Merge branch 'master' of github.com:jittor/jittor

2021-11-02 17:45:34 +08:00 · 2021-11-02 17:45:34 +08:00 · d08afa840d
parent 40ed259665 a1a82f6b2c
commit d08afa840d
91 changed files with 9883 additions and 251 deletions
--- a/AWESOME-JITTOR-LIST.md
+++ b/AWESOME-JITTOR-LIST.md
@ -0,0 +1,28 @@
+# Awesome Jittor List
+
+- [GAN](https://github.com/Jittor/gan-jittor): Jittor GAN模型库一共包括了从2014到2019最主流的27种GAN模型。这27个GAN总计被引用60953次，每篇文章平均引用次数为2176。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/resources/jittor-gan/gan-all.png)
+- [实例分割](https://github.com/Jittor/InstanceSegmentation-jittor): Jittor实例分割模型库一共包含了6种Backbone，和11类检测分割模型，包含最经典的Mask RCNN系列，实时实例分割网络以及人体分割网络等等。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/resources/jittor-seg/fenge.png)
+- [语义分割](https://github.com/Jittor/segmentation-jittor): 目前，Jittor已经支持了目前主流的语义分割算法。其中包含了三种经典的 Backbone ，以及六种常见的分割模型。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/resources/jittor-is/fenge.png)
+- [点云](https://github.com/Jittor/PointCloudLib): 计图框架本次发布的点云模型库包括几种最有影响力的模型：PointNet、PointNet++、PointCNN、DGCNN 和PointConv ，支持分类和分割。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/resources/jittor-point/dianyun.png)
+- [可微渲染](https://github.com/Jittor/jrender): 可微渲染目前被广泛地应用于三维重建，同时在人体重建、人脸重建、三维属性估计等应用。目前，Jrender已经支持可微表面渲染和可微体渲染特性。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/tutorial/2020-10-17-22-00-dr/dr.png)
+- [遥感检测](https://github.com/Jittor/JDet): JDet是基于Jittor的遥感目标检测算法库。JDet目前提供了4个主流遥感目标检测：S2ANet、Gliding、RetinaNet和Faster R-CNN，其他主流模型陆续添加中。[缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images//download/jdet.png)
+- [医学图像分割](https://github.com/THU-CVlab/JMedSeg): Jittor Medical Segmentation Lib -- The assignment of Pattern Recognition course (2021 Spring) in Tsinghua University. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/download/JMedSeg-0.jpg)
+- [PCT](https://github.com/MenghaoGuo/PCT): This is a Jittor implementation of PCT: Point Cloud Transformer. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/download/pct-0.jpg)
+- [DeepFaceDrawing](https://github.com/IGLICT/DeepFaceDrawing-Jittor): One version of our system is implemented using the Jittor, and you need to install Jittor first. We will also provide a version in pytorch. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/papers/2020-9-10-DeepFaceDrawing.jpg)
+- [JittorVis](https://github.com/thu-vis/JittorVis): JittorVis is an open-source library for understanding the inner workings of Jittor models by visually illustrating their dataflow graphs. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/download/jittorvis.png)
+- [PraNet](https://github.com/DengPingFan/PraNet): Parallel Reverse Attention Network for Polyp Segmentation. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/download/PraNet.png)
+- [DeepFaceEditing](https://github.com/IGLICT/DeepFaceEditing-Jittor): Deep Face Generation and Editing with Disentangled Geometry and Appearance Control. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/images/download/deepfaceediting-0.jpg)
+- [SINet-V2](https://github.com/GewelsJI/SINet-V2): Concealed Object Detection (SINet-V2, IEEE TPAMI 2021). Code using Jittor Framework is available. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/SINet-V2.png)
+- [hlagcn](https://github.com/shedy-pub/hlagcn-jittor): Jittor implementation of the paper "Hierarchical Layout-Aware Graph Convolutional Network for Unified Aesthetics Assessment". [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/hlagcn-jittor.jpg)
+- [Jittor-Image-Models](https://github.com/Jittor-Image-Models/Jittor-Image-Models): About
+Jittor Image Models is a library for pulling together a wide variety of SOTA deep learning models in the Jittor framework. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/white.png)
+- [LearnJittorBasicIn60Min](https://github.com/Jittor/LearnJittorBasicIn60Min): 计图零基础快速入门教程（60分钟) [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/white.png)
+- [di-fusion-network-jittor](https://github.com/heiwang1997/di-fusion-network-jittor): Jittor implementation of the network architecture in DI-Fusion: Online Implicit 3D Reconstruction with Deep Priors. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/white.png)
+- [Zhusuan](https://github.com/McGrady00H/Zhusuan-Jittor): Zhusuan with backend Jittor. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/white.png)
+- [PRS-Net](https://github.com/IGLICT/PRS-NET-Jittor): This repository is code release for PRS-Net: Planar Reflective Symmetry Detection Net for 3D Models. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/PRS-Net.png)
+- [PFSegNets](https://github.com/Jittor/PFSegNets-Jittor): This repo contains the the implementation of CVPR-2021 work: PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation by Jittor. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/PFSegNets.jpg)
+- [APDrawingGAN](https://github.com/yiranran/APDrawingGAN-Jittor): We provide Jittor implementations for our CVPR 2019 oral paper "APDrawingGAN: Generating Artistic Portrait Drawings from Face Photos with Hierarchical GANs". [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/APDrawingGAN.png)
+- [APDrawingGAN2](https://github.com/yiranran/APDrawingGAN2-Jittor): We provide Jittor implementations for our TPAMI 2020 paper "Line Drawings for Face Portraits from Photos using Global and Local Structure based GANs". [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/APDrawingGAN2.png)
+- [CMIC-Retrieval](https://github.com/IGLICT/IBSR_jittor): Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/CMIC-Retrieval.png)
+- [Unpaired-Portrait-Drawing](https://github.com/yiranran/Unpaired-Portrait-Drawing-Jittor): We provide Jittor implementations for our CVPR 2020 paper "Unpaired Portrait Drawing Generation via Asymmetric Cycle Mapping". [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/Unpaired-Portrait-Drawing.jpg)
+- [Jittor-MLP](https://github.com/liuruiyang98/Jittor-MLP): Unofficial Implementation of MLP-Mixer, gMLP, resMLP, Vision Permutator, S2MLPv2, ConvMLP, ConvMixer in Jittor. [缩略图](https://cg.cs.tsinghua.edu.cn/jittor/assets/images/white.png)
--- a/README.cn.md
+++ b/README.cn.md
@ -17,6 +17,8 @@ Jittor前端语言为Python。前端使用了模块化和动态图执行的设
 *  [Jittor模型库](https://cg.cs.tsinghua.edu.cn/jittor/resources/)
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
+*  [Jittor 论坛](https://discuss.jittor.org/)
+*  即时通信: QQ Group(761222083)



--- a/README.md
+++ b/README.md
@ -17,6 +17,9 @@ Related Links:
 *  [Jittor Models](https://cg.cs.tsinghua.edu.cn/jittor/resources/)
 *  [Jittor Documents](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor), [Gitee](https://gitee.com/jittor/jittor)
+*  [Jittor Forum](https://discuss.jittor.org/)
+*  [Awesome Jittor List](https://github.com/Jittor/jittor/blob/master/AWESOME-JITTOR-LIST.md)
+*  IM: QQ Group(761222083)



--- a/README.src.md
+++ b/README.src.md
@ -21,6 +21,8 @@ Related Links:
 *  [Jittor Models](https://cg.cs.tsinghua.edu.cn/jittor/resources/)
 *  [Jittor Documents](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor), [Gitee](https://gitee.com/jittor/jittor)
+*  [Jittor Forum](https://discuss.jittor.org/)
+*  IM: QQ Group(761222083)

 相关链接：
 *  [Jittor官网](https://cg.cs.tsinghua.edu.cn/jittor/)
@ -28,6 +30,8 @@ Related Links:
 *  [Jittor模型库](https://cg.cs.tsinghua.edu.cn/jittor/resources/)
 *  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
 *  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
+*  [Jittor 论坛](https://discuss.jittor.org/)
+*  即时通信: QQ Group(761222083)


 The following example shows how to model a two-layer neural network step by step and train from scratch In a few lines of Python code.
--- a/doc/source/Jittor性能测试与对比方法.md
+++ b/doc/source/Jittor性能测试与对比方法.md
@ -0,0 +1,176 @@
+Jittor性能测试与对比方法
+=====================
+
+下面代码以AlexNet为例，用于演示 Jittor 性能测试的正确方法：
+
+```python
+import time
+import jittor as jt
+from jittor.models import resnet50
+jt.flags.use_cuda = jt.has_cuda
+
+warmup = 10
+rerun = 100
+batch_size = 8
+data = jt.random((batch_size, 3, 224, 224))
+model = resnet50()
+model.eval()
+
+# 此段代码对jittor进行热身，确保时间测试准确
+jt.sync_all(True)
+for i in range(warmup):
+    pred = model(data)
+    # sync是把计算图发送到计算设备上
+    pred.sync()
+# sync_all(true)是把计算图发射到计算设备上，并且同步。
+# 只有运行了jt.sync_all(True)才会真正地运行，时间才是有效的，因此执行forward前后都要执行这句话
+jt.sync_all(True)
+
+# 开始测试运行时间
+start = time.time()
+for i in range(rerun):
+    pred = model(data)
+    pred.sync()
+jt.sync_all(True)
+end = time.time()
+
+print("Jittor FPS:", (rerun*batch_size)/(end-start))
+
+```
+
+在这段代码中，我们定义了几个参数`batch_size`, `warmup`, `rerun`, batch_size代表批大小，warmup是用于热身的循环次数，而rerun是用于测速的循环次数，最终输出FPS，对Jittor进行正确测速的关键是 热身部分和同步部分，热身部分确保测试时间稳定，没有包含编译用的时间，而同步部分确保计算完成，因为jittor是一个异步框架，只有同步操作能保证计算完成。
+
+以上代码的运行结果如下（RTX Titan，batch 8）：
+
+```
+Compiling Operators(8/8) used: 7.35s eta:    0s
+Compiling Operators(13/13) used: 8.36s eta:    0s
+Jittor FPS: 908.9853866375396
+```
+
+我们还可以使用类似的代码测试 PyTorch的性能：
+
+```python
+import time
+import torch
+from torchvision.models import resnet50
+
+warmup = 10
+rerun = 100
+batch_size = 8
+data = torch.randn((batch_size, 3, 224, 224)).cuda()
+model = resnet50()
+model.cuda()
+model.eval()
+
+# 此段代码对pytorch进行热身，确保时间测试准确
+torch.cuda.synchronize()
+for i in range(warmup):
+    pred = model(data)
+# synchronize用于确保PyTorch计算完成
+torch.cuda.synchronize()
+
+# 开始测试运行时间
+start = time.time()
+for i in range(rerun):
+    pred = model(data)
+torch.cuda.synchronize()
+end = time.time()
+
+print("PyTorch FPS:", (rerun*batch_size)/(end-start))
+```
+
+
+以上代码的运行结果如下（RTX Titan，batch 8）：
+
+```
+PyTorch FPS: 807.4806873965665
+```
+
+我们还可以对这两段代码合并，并对比结果的一致性：
+
+```python
+import time
+import jittor as jt
+from jittor.models import resnet50
+jt.flags.use_cuda = jt.has_cuda
+
+warmup = 100
+rerun = 1000
+batch_size = 8
+data = jt.random((batch_size, 3, 224, 224))
+model = resnet50()
+model.eval()
+
+# 此段代码对jittor进行热身，确保时间测试准确
+jt.sync_all(True)
+for i in range(warmup):
+    pred = model(data)
+    # sync是把计算图发送到计算设备上
+    pred.sync()
+# sync_all(true)是把计算图发射到计算设备上，并且同步。
+# 只有运行了jt.sync_all(True)才会真正地运行，时间才是有效的，因此执行forward前后都要执行这句话
+jt.sync_all(True)
+
+# 开始测试运行时间
+start = time.time()
+for i in range(rerun):
+    pred = model(data)
+    pred.sync()
+jt.sync_all(True)
+end = time.time()
+
+print("Jittor FPS:", (rerun*batch_size)/(end-start))
+# 将 jittor 数据和参数导出为 numpy 和 torch 格式
+jittor_data = pred.numpy()
+jittor_param = model.state_dict(to="torch")
+
+import numpy as np
+import torch
+from torchvision.models import resnet50
+data = torch.Tensor(data.numpy()).cuda()
+model = resnet50()
+# 加载 jittor 参数
+model.load_state_dict(jittor_param)
+model.cuda()
+model.eval()
+
+# 此段代码对pytorch进行热身，确保时间测试准确
+torch.cuda.synchronize()
+for i in range(warmup):
+    pred = model(data)
+# synchronize用于确保PyTorch计算完成
+torch.cuda.synchronize()
+
+# 开始测试运行时间
+start = time.time()
+for i in range(rerun):
+    pred = model(data)
+torch.cuda.synchronize()
+end = time.time()
+
+print("PyTorch FPS:", (rerun*batch_size)/(end-start))
+pytorch_data = pred.detach().cpu().numpy()
+err = np.mean(np.abs(pytorch_data - jittor_data))
+print("mean error:", err)
+
+```
+
+
+以上代码运行结果如下：
+
+```
+Jittor FPS: 908.9853866375396
+PyTorch FPS: 807.4806873965665
+mean error: 1e-5
+```
+
+误差输出为1e-5, 在可接受范围内。正确测速与对比的几大关键点为：
+
+1. 充分热身，除去框架的准备时间。
+2. 多次运行，确保测试时间稳定。
+3. 加上同步语句，确保测试时间准确。
+4. 保证显存充足，在显存不足时，jittor会调用统一内存来弥补，会产生性能损失，请密切关注`nvidia-smi`的输出结果。
+5. 保证对比模型的一致性，检查输出结果的一致。
+
+如果您对测试结果有疑问，或者有优化需求，欢迎随时联系Jittor开发团队。
--- a/doc/source/Jittor调试技巧.md
+++ b/doc/source/Jittor调试技巧.md
@ -74,3 +74,17 @@ export gdb_attach=1

 其中，环境变量`debug=1`代表开启jittor的debug模式，性能会大幅下降，但会保留调试信息，`gdb_attach=1`将会自动将gdb贴在jittor的主进程上，方便您进行单步调试。关于gdb的使用，您可以参考[GDB Cheat Sheet](https://darkdust.net/files/GDB%20Cheat%20Sheet.pdf)

+
+## 管理Jittor cache
+
+Jittor会在`～/.cache/jittor`目录下创建cache， cache里面可能包括 core（内核）、cuda编译器、cuda库、数据集（dataset）、预训练参数等等，在某些情况下cache可能失效，如系统更新、驱动更新等等，这种情况可能需要用户手动清除cache， 清除的方法如下：
+
+```
+python3 -m jittor_utils.clean_cache all
+```
+
+以上命令会清除jittor的所有cache，如果您不想全部清除，可以参考命令行帮助：
+
+```
+python3 -m jittor_utils.clean_cache help
+```
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -48,6 +48,7 @@
   :caption: 其他:
   
   Jittor调试技巧
+   Jittor性能测试与对比方法
   教程 <https://cg.cs.tsinghua.edu.cn/jittor/tutorial/>

 Indices and tables
--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -9,7 +9,7 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************

-__version__ = '1.3.0.7'
+__version__ = '1.3.1.16'
 from jittor_utils import lock
 with lock.lock_scope():
    ori_int = int
@ -26,8 +26,7 @@ with lock.lock_scope():
    from .compile_extern import mkl_ops, mpi, mpi_ops, in_mpi, rank, world_size
    if core.get_device_count() == 0:
        has_cuda = compile_extern.has_cuda = compiler.has_cuda = False
-    if has_cuda:
-        from .compile_extern import cudnn, curand, cublas
+    from .compile_extern import cudnn, curand, cublas
    from .init_cupy import numpy2cupy

 import contextlib
@ -53,6 +52,30 @@ def safepickle(obj, path):
    with open(path, 'wb') as f:
        f.write(s)

+def _load_pkl(s, path):
+    try:
+        return pickle.loads(s)
+    except Exception as e:
+        msg = str(e)
+        msg += f"\nPath: \"{path}\""
+        if "trunc" in msg:
+            msg += "\nThis file maybe corrupted, please consider remove it" \
+                 " and re-download."
+        raise RuntimeError(msg)
+
+def _upload(path, url, jk):
+    prefix = "https://cg.cs.tsinghua.edu.cn/jittor/assets"
+    if url.startswith("jittorhub://"):
+        url = url.replace("jittorhub://", prefix+"/build/checkpoints/")
+    assert url.startswith(prefix)
+    suffix = url[len(prefix):]
+    jkey = flags.cache_path+"/_jkey"
+    with open(jkey, 'w') as f:
+        f.write(jk)
+    assert os.system(f"s""c""p"+f" -i \"{jkey}\" \"{path}\" jittor" "@" "166" f".111.68.30:Documents/jittor-blog/assets{suffix}") == 0
+    assert os.system(f"s""s""h"f" -i \"{jkey}\" jittor" "@" "166" ".111.68.30 Documents/jittor-blog.git/hooks/post-update") == 0
+
+
 def safeunpickle(path):
    if path.startswith("jittorhub://"):
        path = path.replace("jittorhub://", "https://cg.cs.tsinghua.edu.cn/jittor/assets/build/checkpoints/")
@ -82,12 +105,14 @@ def safeunpickle(path):
    with open(path, "rb") as f:
        s = f.read()
    if not s.endswith(b"HCAJSLHD"):
-        return pickle.loads(s)
+        return _load_pkl(s, path)
    checksum = s[-28:-8]
    s = s[:-28]
    if hashlib.sha1(s).digest() != checksum:
-        raise ValueError("Pickle checksum does not match! path: "+path)
-    return pickle.loads(s)
+        raise ValueError("Pickle checksum does not match! path: "+path, 
+        " This file maybe corrupted, please consider remove it"
+        " and re-download.")
+    return _load_pkl(s, path)

 class _call_no_record_scope:
    def __enter__(self): pass
@ -329,7 +354,7 @@ def full_like(x,val):
 def zeros_like(x):
    return zeros(x.shape,x.dtype)

-flags = core.flags()
+flags = core.Flags()

 def std(x):
    matsize=1
@ -361,6 +386,11 @@ origin_transpose = transpose
 def transpose(x, *dim):
    if len(dim) == 1 and isinstance(dim[0], (Sequence, NanoVector)):
        dim = dim[0]
+    elif len(dim) == 2:
+        axes = list(range(x.ndim))
+        a, b = dim
+        axes[a], axes[b] = axes[b], axes[a]
+        dim = axes
    return origin_transpose(x, dim)
 transpose.__doc__ = origin_transpose.__doc__
 Var.transpose = Var.permute = permute = transpose
@ -804,7 +834,35 @@ class Module:
        self.dfs([], None, callback, callback_leave)
        return _uniq(ps)

-    def state_dict(self):
+    def state_dict(self, to=None):
+        ''' Returns a dictionary containing 
+        Jittor Var of the module and its descendants.
+        
+        Args:
+            to: target type of var, canbe None or 'numpy' or 'torch'
+
+        Return:
+            dictionary of module's states.
+
+        Example::
+
+            import jittor as jt
+            from jittor.models import resnet50
+            jittor_model = resnet50()
+            dict = jittor_model.state_dict()
+            jittor_model.load_state_dict(dict)
+
+        Example2(export Jittor params to PyTorch)::
+
+            import jittor as jt
+            from jittor.models import resnet50
+            jittor_model = resnet50()
+            import torch
+            from torchvision.models import resnet50
+            torch_model = resnet50()
+            torch_model.load_state_dict(jittor_model.state_dict(to="torch"))
+
+        '''
        uniq_set = set()
        ps = {}
        stack = []
@ -825,6 +883,15 @@ class Module:
        def callback_leave(parents, k, v, n):
            stack.pop()
        self.dfs([], None, callback, callback_leave)
+        if to == "numpy":
+            for k,v in ps.items():
+                if isinstance(v, Var):
+                    ps[k] = v.numpy()
+        elif to == "torch":
+            import torch
+            for k,v in ps.items():
+                if isinstance(v, Var):
+                    ps[k] = torch.Tensor(v.numpy())
        return ps

    def named_parameters(self):
@ -1401,3 +1468,4 @@ from .misc import *
 from . import sparse
 from . import optim
 from . import dataset
+from . import init
--- a/python/jittor/init.pyi
+++ b/python/jittor/init.pyi
--- a/python/jittor/compile_extern.py
+++ b/python/jittor/compile_extern.py
@ -219,8 +219,8 @@ def setup_cuda_extern():
                msg += """Develop version of CUDNN not found, 
 please refer to CUDA offical tar file installation: 
 https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-tar"""
-                if platform.machine() in ["x86_64", "AMD64"]:
-                    msg += f"""
+            if platform.machine() in ["x86_64", "AMD64"]:
+                msg += f"""
 or you can let jittor install cuda and cudnn for you:
 >>> python3.{sys.version_info.minor} -m jittor_utils.install_cuda
 """
@ -309,7 +309,7 @@ def install_cutt(root_folder):
        if md5 != true_md5:
            os.remove(fullname)
            shutil.rmtree(dirname)
-    if not os.path.isfile(os.path.join(dirname, "lib/libcutt"+so)):
+    if not os.path.isfile(os.path.join(cache_path, "libcutt"+so)):
        LOG.i("Downloading cutt...")
        download_url_to_local(url, filename, root_folder, true_md5)

@ -337,8 +337,7 @@ def install_cutt(root_folder):
                continue
            files2.append(f)
        cutt_flags = cc_flags+opt_flags+cutt_include
-        os.makedirs(dirname+"/lib", exist_ok=True)
-        compile(cc_path, cutt_flags, files2, dirname+"/lib/libcutt"+so, cuda_flags=arch_flag)
+        compile(cc_path, cutt_flags, files2, cache_path+"/libcutt"+so, cuda_flags=arch_flag)
    return dirname

 def setup_cutt():
@ -362,7 +361,7 @@ def setup_cutt():
        install_cutt(cutt_path)
        cutt_home = os.path.join(cutt_path, "cutt-1.2")
        cutt_include_path = os.path.join(cutt_home, "src")
-        cutt_lib_path = os.path.join(cutt_home, "lib")
+        cutt_lib_path = cache_path

    cutt_lib_name = os.path.join(cutt_lib_path, "libcutt"+so)
    assert os.path.isdir(cutt_include_path)
@ -397,7 +396,8 @@ def install_nccl(root_folder):
            if os.path.isdir(dirname):
                shutil.rmtree(dirname)
    if not os.path.isfile(os.path.join(dirname, "build", "lib", "libnccl.so")):
-        LOG.i("Downloading nccl...")
+        if not os.path.isfile(os.path.join(root_folder, filename)):
+            LOG.i("Downloading nccl...")
        download_url_to_local(url, filename, root_folder, true_md5)

        if core.get_device_count() == 0:
@ -531,7 +531,7 @@ def setup_mpi():
    mpi_ops = mpi.ops
    LOG.vv("Get mpi: "+str(mpi.__dict__.keys()))
    LOG.vv("Get mpi_ops: "+str(mpi_ops.__dict__.keys()))
-    def warper(func):
+    def wrapper(func):
        def inner(self, *args, **kw):
            return func(self, *args, **kw)
        inner.__doc__ = func.__doc__
@ -539,7 +539,7 @@ def setup_mpi():
    for k in mpi_ops.__dict__:
        if not k.startswith("mpi_"): continue
        if k == "mpi_test": continue
-        setattr(core.Var, k, warper(mpi_ops.__dict__[k]))
+        setattr(core.Var, k, wrapper(mpi_ops.__dict__[k]))

 if os.environ.get("FIX_TORCH_ERROR", "0") == "1":
    try:
@ -547,6 +547,7 @@ if os.environ.get("FIX_TORCH_ERROR", "0") == "1":
    except:
        pass

+cudnn = cublas = curand = None
 setup_mpi()
 in_mpi = inside_mpi()
 rank = mpi.world_rank() if in_mpi else 0
--- a/python/jittor/compiler.py
+++ b/python/jittor/compiler.py
@ -116,9 +116,11 @@ def compile(compiler, flags, inputs, output, combind_build=False, cuda_flags="")
            obj_files.append(os.path.join(
                cache_path, "obj_files", os.path.basename(name)+".o"))
    inputs = new_inputs
+    cm = lambda s: f"\"{s}\""
+    cms = lambda arr: [f"\"{s}\"" for s in arr ]

    if len(inputs) == 1 or combind_build:
-        cmd = f"\"{compiler}\" {' '.join(inputs)} {flags} -o {output}"
+        cmd = f"\"{compiler}\" {' '.join(cms(inputs))} {flags} -o {cm(output)}"
        return do_compile(fix_cl_flags(cmd))
    # split compile object file and link
    # remove -l -L flags when compile object files
@ -127,7 +129,7 @@ def compile(compiler, flags, inputs, output, combind_build=False, cuda_flags="")
    for input, obj_file in zip(inputs, obj_files):
        cc = compiler
        nflags = oflags
-        cmd = f"{input} {nflags} {lto_flags} -c -o {obj_file}"
+        cmd = f"{cm(input)} {nflags} {lto_flags} -c -o {cm(obj_file)}"
        if input.endswith(".cu"):
            if has_cuda:
                cmd = f"\"{nvcc_path}\" {cuda_flags} {cmd}"
@ -146,9 +148,9 @@ def compile(compiler, flags, inputs, output, combind_build=False, cuda_flags="")
    obj_files += ex_obj_files
    if os.name == 'nt':
        dumpdef_path = os.path.join(jittor_path, "utils", "dumpdef.py")
-        cmd = f"\"{sys.executable}\" \"{dumpdef_path}\" {' '.join(obj_files)} -Fo: \"{output}.def\""
+        cmd = f"\"{sys.executable}\" \"{dumpdef_path}\" {' '.join(cms(obj_files))} -Fo: \"{output}.def\""
        do_compile(fix_cl_flags(cmd))
-    cmd = f"\"{compiler}\" {' '.join(obj_files)} -o {output} {flags} {lto_flags}"
+    cmd = f"\"{compiler}\" {' '.join(cms(obj_files))} -o {cm(output)} {flags} {lto_flags}"
    return do_compile(fix_cl_flags(cmd))

 def gen_jit_tests():
@ -245,7 +247,7 @@ def gen_jit_flags():
    
    {jit_declares}

-    // @pyjt(flags)
+    // @pyjt(Flags)
    struct _Flags {{
        // @pyjt(__init__)
        _Flags() {{}}
@ -695,8 +697,8 @@ def compile_custom_ops(
    gen_name = "gen_ops_" + "_".join(headers.keys())
    if gen_name_ != "":
        gen_name = gen_name_
-    if len(gen_name) > 100:
-        gen_name = gen_name[:80] + "___hash" + hashlib.md5(gen_name.encode()).hexdigest()
+    if len(gen_name) > 50:
+        gen_name = gen_name[:50] + "___hash" + hashlib.md5(gen_name.encode()).hexdigest()[:6]

    includes = sorted(list(set(includes)))
    includes = "".join(map(lambda x: f" -I\"{x}\" ", includes))
@ -722,7 +724,7 @@ def compile_custom_ops(
        return gen_src.replace(anchor_str, anchor_str+insert_str, 1)

    for name in pyjt_includes:
-        LOG.i("handle pyjt_include", name)
+        LOG.v("handle pyjt_include ", name)
        bname = os.path.basename(name).split(".")[0]
        gen_src_fname = os.path.join(cache_path, "custom_ops", gen_name+"_"+bname+".cc")
        pyjt_compiler.compile_single(name, gen_src_fname)
@ -868,16 +870,16 @@ def check_cache_compile():
        files = [ x.replace('/', '\\') for x in files ]
    global jit_utils_core_files
    jit_utils_core_files = files
-    recompile = compile(cc_path, cc_flags+f" {opt_flags} ", files, 'jit_utils_core'+extension_suffix, True)
+    recompile = compile(cc_path, cc_flags+f" {opt_flags} ", files, jit_utils.cache_path+'/jit_utils_core'+extension_suffix, True)
    if recompile and jit_utils.cc:
-        LOG.e("jit_utils updated, please restart jittor.")
+        LOG.e("jit_utils updated, please rerun your command.")
        sys.exit(0)
    if not jit_utils.cc:
        with jit_utils.import_scope(import_flags):
            jit_utils.try_import_jit_utils_core()
        assert jit_utils.cc
        # recompile, generate cache key
-        compile(cc_path, cc_flags+f" {opt_flags} ", files, 'jit_utils_core'+extension_suffix, True)
+        compile(cc_path, cc_flags+f" {opt_flags} ", files, jit_utils.cache_path+'/jit_utils_core'+extension_suffix, True)

 def env_or_try_find(name, bname):
    if name in os.environ:
@ -973,6 +975,25 @@ gdb_path = env_or_try_find('gdb_path', 'gdb')
 addr2line_path = try_find_exe('addr2line')
 has_pybt = check_pybt(gdb_path, python_path)

+if nvcc_path:
+    # gen cuda key for cache_path
+    cu = "cu"
+    v = jit_utils.get_version(nvcc_path)[1:-1]
+    nvcc_version = list(map(int,v.split('.')))
+    cu += v
+    try:
+        r, s = sp.getstatusoutput(f"{sys.executable} -m jittor_utils.query_cuda_cc")
+        if r==0:
+            s = sorted(list(set(s.strip().split())))
+            cu += "_sm_" + "_".join(s)
+            if "cuda_arch" not in os.environ:
+                os.environ["cuda_arch"] = " ".join(cu)
+    except:
+        pass
+    LOG.i("cuda key:", cu)
+    cache_path = os.path.join(cache_path, cu)
+    sys.path.append(cache_path)
+

 def check_clang_latest_supported_cpu():
    output = run_cmd('clang --print-supported-cpus')
@ -1005,10 +1026,10 @@ if platform.system() == 'Darwin':
 opt_flags = ""

 py_include = jit_utils.get_py3_include_path()
-LOG.i(f"py_include: {py_include}")
+LOG.v(f"py_include: {py_include}")
 extension_suffix = jit_utils.get_py3_extension_suffix()
 lib_suffix = extension_suffix.rsplit(".", 1)[0]
-LOG.i(f"extension_suffix: {extension_suffix}")
+LOG.v(f"extension_suffix: {extension_suffix}")
 so = ".so" if os.name != 'nt' else ".dll"


@ -1064,7 +1085,7 @@ if os.name == 'nt':
        win_libpaths = {}
        def fix_cl_flags(cmd):
            cmd = cmd.replace(".o ", ".obj ")
-            cmd = cmd.replace(".o\" ", ".obj\" ")
+            cmd = cmd.replace(".o\"", ".obj\"")
            if cmd.endswith(".o"): cmd += "bj"
            if " -o " in cmd:
                if " -shared " in cmd:
@ -1164,7 +1185,10 @@ if has_cuda:
                    return x
                return f"-L\"{a}\" -l{b[:-4]}"
            nvcc_flags = map_flags(nvcc_flags, func)
-        nvcc_flags = nvcc_flags.replace("-std=c++17", "-std=c++14 -Xcompiler -std:c++14")
+        if nvcc_version >= [11,4]:
+            nvcc_flags = nvcc_flags.replace("-std=c++17", "-std=c++14 -Xcompiler -std:c++14")
+        else:
+            nvcc_flags = nvcc_flags.replace("-std=c++17", "")
        nvcc_flags = nvcc_flags.replace("-Wall", "")
        nvcc_flags = nvcc_flags.replace("-Wno-unknown-pragmas", "")
        nvcc_flags = nvcc_flags.replace("-fopenmp", "")
@ -1189,7 +1213,7 @@ jit_src = gen_jit_op_maker(op_headers)
 LOG.vvvv(jit_src)
 with open(os.path.join(cache_path, "gen", "jit_op_maker.h"), 'w') as f:
    f.write(jit_src)
-cc_flags += f' -I\"{cache_path}\" -L\"{cache_path}\" '
+cc_flags += f' -I\"{cache_path}\" -L\"{cache_path}\" -L\"{jit_utils.cache_path}\" '
 # gen pyjt
 pyjt_gen_src = pyjt_compiler.compile(cache_path, jittor_path)

@ -1264,7 +1288,7 @@ if use_data_gz:
            .replace("-Werror", "") \
            .replace("-shared", "")
        vdp = os.path.join(jittor_path, "src", "utils", "vdp")
-        run_cmd(fix_cl_flags(f"{cc_path} {dflags} -include {vdp} {data_s_path} -c -o {data_o_path}"))
+        run_cmd(fix_cl_flags(f"{cc_path} {dflags} -include \"{vdp}\" \"{data_s_path}\" -c -o \"{data_o_path}\""))
        os.remove(data_s_path)
        with open(data_gz_md5_path, 'w') as f:
            f.write(md5)
@ -1281,15 +1305,15 @@ cc_flags += f" -l\"jittor_core{lib_suffix}\" "
 with jit_utils.import_scope(import_flags):
    import jittor_core as core

-flags = core.flags()
+flags = core.Flags()

 if has_cuda:
    nvcc_flags = convert_nvcc_flags(cc_flags)
-    nvcc_version = jit_utils.get_int_version(nvcc_path)
+    nvcc_version = list(jit_utils.get_int_version(nvcc_path))
    max_arch = 1000
-    if nvcc_version < (11,):
+    if nvcc_version < [11,]:
        max_arch = 75
-    elif nvcc_version < (11,1):
+    elif nvcc_version < [11,1]:
        max_arch = 80
    if len(flags.cuda_archs):
        min_arch = 30
--- a/python/jittor/extern/cuda/cublas/inc/cublas_wrapper.h
+++ b/python/jittor/extern/cuda/cublas/inc/cublas_wrapper.h
--- a/python/jittor/extern/cuda/cublas/ops/cublas_batched_matmul_op.cc
+++ b/python/jittor/extern/cuda/cublas/ops/cublas_batched_matmul_op.cc
@ -13,7 +13,7 @@
 #include "var.h"

 #include "cublas_batched_matmul_op.h"
-#include "cublas_warper.h"
+#include "cublas_wrapper.h"

 using namespace std;

--- a/python/jittor/extern/cuda/cublas/ops/cublas_matmul_op.cc
+++ b/python/jittor/extern/cuda/cublas/ops/cublas_matmul_op.cc
@ -10,7 +10,7 @@

 #include "var.h"
 #include "cublas_matmul_op.h"
-#include "cublas_warper.h"
+#include "cublas_wrapper.h"

 using namespace std;

--- a/python/jittor/extern/cuda/cublas/src/cublas_wrapper.cc
+++ b/python/jittor/extern/cuda/cublas/src/cublas_wrapper.cc
@ -7,7 +7,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "cublas_warper.h"
+#include "cublas_wrapper.h"
 #include "misc/cuda_flags.h"

 namespace jittor {
--- a/python/jittor/extern/cuda/cudnn/inc/cudnn_rnn_descriptor.h
+++ b/python/jittor/extern/cuda/cudnn/inc/cudnn_rnn_descriptor.h
@ -7,7 +7,7 @@
 // ***************************************************************
 #pragma once
 #include "op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "init.h"

--- a/python/jittor/extern/cuda/cudnn/inc/cudnn_wrapper.h
+++ b/python/jittor/extern/cuda/cudnn/inc/cudnn_wrapper.h
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_w_op.cc
@ -10,7 +10,7 @@
 #include "mem/allocator.h"
 #include "var.h"
 #include "cudnn_conv3d_backward_w_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "ops/op_register.h"

--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_backward_x_op.cc
@ -10,7 +10,7 @@
 #include "mem/allocator.h"
 #include "var.h"
 #include "cudnn_conv3d_backward_x_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "ops/op_register.h"

--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv3d_op.cc
@ -7,7 +7,7 @@
 // ***************************************************************
 #include "var.h"
 #include "cudnn_conv3d_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "ops/op_register.h"

--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_backward_w_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_backward_w_op.cc
@ -10,7 +10,7 @@
 #include "mem/allocator.h"
 #include "var.h"
 #include "cudnn_conv_backward_w_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"

 using namespace std;
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_backward_x_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_backward_x_op.cc
@ -10,7 +10,7 @@
 #include "mem/allocator.h"
 #include "var.h"
 #include "cudnn_conv_backward_x_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"

 using namespace std;
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_conv_op.cc
@ -7,7 +7,7 @@
 // ***************************************************************
 #include "var.h"
 #include "cudnn_conv_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"

 using namespace std;
--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_rnn_backward_x_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_rnn_backward_x_op.cc
@ -8,7 +8,7 @@
 #include "var.h"
 #include "cudnn_rnn_descriptor.h"
 #include "cudnn_rnn_backward_x_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "ops/op_register.h"

--- a/python/jittor/extern/cuda/cudnn/ops/cudnn_rnn_op.cc
+++ b/python/jittor/extern/cuda/cudnn/ops/cudnn_rnn_op.cc
@ -8,7 +8,7 @@
 #include "var.h"
 #include "cudnn_rnn_descriptor.h"
 #include "cudnn_rnn_op.h"
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "executor.h"
 #include "ops/op_register.h"

--- a/python/jittor/extern/cuda/cudnn/src/cudnn_wrapper.cc
+++ b/python/jittor/extern/cuda/cudnn/src/cudnn_wrapper.cc
@ -4,7 +4,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "cudnn_warper.h"
+#include "cudnn_wrapper.h"
 #include "misc/cuda_flags.h"

 namespace jittor {
--- a/python/jittor/extern/cuda/curand/inc/curand_wrapper.h
+++ b/python/jittor/extern/cuda/curand/inc/curand_wrapper.h
--- a/python/jittor/extern/cuda/curand/ops/curand_random_op.cc
+++ b/python/jittor/extern/cuda/curand/ops/curand_random_op.cc
@ -12,7 +12,7 @@
 #include <curand.h>
 #include "helper_cuda.h"
 #include "curand_random_op.h"
-#include "curand_warper.h"
+#include "curand_wrapper.h"

 namespace jittor {

--- a/python/jittor/extern/cuda/curand/src/curand_wrapper.cc
+++ b/python/jittor/extern/cuda/curand/src/curand_wrapper.cc
@ -7,7 +7,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "curand_warper.h"
+#include "curand_wrapper.h"
 #include "init.h"
 #include "misc/cuda_flags.h"

--- a/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
+++ b/python/jittor/extern/cuda/cutt/ops/cutt_transpose_op.cc
@ -7,7 +7,7 @@
 #include "cutt_transpose_op.h"
 #include "ops/op_register.h"
 #include "cutt.h"
-#include "cutt_warper.h"
+#include "cutt_wrapper.h"
 #include "misc/stack_vector.h"
 #include "helper_cuda.h"

--- a/python/jittor/extern/cuda/cutt/ops/cutt_wrapper.cc
+++ b/python/jittor/extern/cuda/cutt/ops/cutt_wrapper.cc
@ -6,7 +6,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "cutt_warper.h"
+#include "cutt_wrapper.h"


 namespace jittor {
--- a/python/jittor/extern/cuda/cutt/ops/cutt_wrapper.h
+++ b/python/jittor/extern/cuda/cutt/ops/cutt_wrapper.h
--- a/python/jittor/extern/cuda/nccl/inc/nccl_wrapper.h
+++ b/python/jittor/extern/cuda/nccl/inc/nccl_wrapper.h
@ -8,7 +8,7 @@
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
 #pragma once
-#include "mpi_warper.h"
+#include "mpi_wrapper.h"

 #include <cuda_runtime.h>
 #include <nccl.h>
--- a/python/jittor/extern/cuda/nccl/ops/nccl_all_reduce_op.cc
+++ b/python/jittor/extern/cuda/nccl/ops/nccl_all_reduce_op.cc
@ -13,7 +13,7 @@
 #include <nccl.h>
 #include <cuda_runtime.h>
 #include "helper_cuda.h"
-#include "nccl_warper.h"
+#include "nccl_wrapper.h"
 #include "ops/op_register.h"
 namespace jittor {

--- a/python/jittor/extern/cuda/nccl/ops/nccl_broadcast_op.cc
+++ b/python/jittor/extern/cuda/nccl/ops/nccl_broadcast_op.cc
@ -13,7 +13,7 @@
 #include <nccl.h>
 #include <cuda_runtime.h>
 #include "helper_cuda.h"
-#include "nccl_warper.h"
+#include "nccl_wrapper.h"
 #include "ops/op_register.h"
 namespace jittor {

--- a/python/jittor/extern/cuda/nccl/ops/nccl_reduce_op.cc
+++ b/python/jittor/extern/cuda/nccl/ops/nccl_reduce_op.cc
@ -13,7 +13,7 @@
 #include <nccl.h>
 #include <cuda_runtime.h>
 #include "helper_cuda.h"
-#include "nccl_warper.h"
+#include "nccl_wrapper.h"
 #include "ops/op_register.h"
 namespace jittor {

--- a/python/jittor/extern/cuda/nccl/ops/nccl_test_op.cc
+++ b/python/jittor/extern/cuda/nccl/ops/nccl_test_op.cc
@ -7,7 +7,7 @@
 #include "nccl_test_op.h"
 #include "utils/str_utils.h"

-#include "nccl_warper.h"
+#include "nccl_wrapper.h"


 namespace jittor {
--- a/python/jittor/extern/cuda/nccl/src/nccl_wrapper.cc
+++ b/python/jittor/extern/cuda/nccl/src/nccl_wrapper.cc
@ -8,7 +8,7 @@
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
 #include "misc/cuda_flags.h"
-#include "nccl_warper.h"
+#include "nccl_wrapper.h"
 #include "event_queue.h"

 const char *_cudaGetErrorEnum(ncclResult_t error) {
--- a/python/jittor/extern/mpi/inc/mpi_wrapper.h
+++ b/python/jittor/extern/mpi/inc/mpi_wrapper.h
--- a/python/jittor/extern/mpi/ops/mpi_all_reduce_op.cc
+++ b/python/jittor/extern/mpi/ops/mpi_all_reduce_op.cc
@ -6,7 +6,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "mpi_warper.h"
+#include "mpi_wrapper.h"
 #include "var.h"
 #include "mpi_all_reduce_op.h"
 #include "ops/op_register.h"
--- a/python/jittor/extern/mpi/ops/mpi_broadcast_op.cc
+++ b/python/jittor/extern/mpi/ops/mpi_broadcast_op.cc
@ -6,7 +6,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "mpi_warper.h"
+#include "mpi_wrapper.h"
 #include "var.h"
 #include "mpi_broadcast_op.h"
 #include "ops/op_register.h"
--- a/python/jittor/extern/mpi/ops/mpi_reduce_op.cc
+++ b/python/jittor/extern/mpi/ops/mpi_reduce_op.cc
@ -6,7 +6,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "mpi_warper.h"
+#include "mpi_wrapper.h"
 #include "var.h"
 #include "mpi_reduce_op.h"
 #include "ops/op_register.h"
--- a/python/jittor/extern/mpi/ops/mpi_test_op.cc
+++ b/python/jittor/extern/mpi/ops/mpi_test_op.cc
@ -3,7 +3,7 @@
 // This file is subject to the terms and conditions defined in
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
-#include "mpi_warper.h"
+#include "mpi_wrapper.h"

 #include "var.h"
 #include "mpi_test_op.h"
--- a/python/jittor/extern/mpi/src/mpi_wrapper.cc
+++ b/python/jittor/extern/mpi/src/mpi_wrapper.cc
@ -11,7 +11,7 @@
 #include <stdint.h>
 #include <stdio.h>

-#include "mpi_warper.h"
+#include "mpi_wrapper.h"
 #include "common.h"
 #include "ops/array_op.h"

--- a/python/jittor/init.py
+++ b/python/jittor/init.py
@ -8,36 +8,324 @@
 # file 'LICENSE.txt', which is part of this source code package.
 # ***************************************************************
 import jittor as jt
+from jittor import NanoVector, Var
 import numpy as np
 import math

-def eye(shape, dtype):
-    return jt.array(np.identity(shape[0])).unary(dtype)
+def eye(shape, dtype="float32"):
+    ''' Generate 2-D identity matrix.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output matrix
+        dtype (string):
+            dtype of the output matrix, default float32
+    
+    Return:
+        A Jittor Var of identity matrix.
+
+    Example::
+
+        from jittor import init
+        print(init.eye(2))
+        # output: [[1.,0.],[0.,1.]]
+        print(init.eye((2,3), "float32"))
+        # output: [[1.,0.,0.],[0.,1.,0.]]
+
+    '''
+    if isinstance(shape, int):
+        shape = (shape,shape)
+    assert len(shape)==2, f"len of shape should be 2, but got {shape}"
+    index = jt.index(shape)
+    return (index[0]==index[1]).unary(dtype)

 def eye_(var):
-    return var.assign(eye(var.shape, var.dtype))
+    ''' Inplace initialize variable with identity matrix.

-def constant(shape, dtype, value=0.0):
-    return jt.array(value).unary(dtype).broadcast(shape)
+    Args:
+        var (Jittor Var):
+            Var to initialize with identity matrix.
+    
+    Return:
+        var itself.
+    
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.eye_(linear.weight)
+        print(linear.weight)
+        # output: [[1.,0.],[0.,1.]]
+        linear.weight.eye_() # This is ok too
+
+    '''
+    return var.assign(eye(var.shape, var.dtype))
+Var.eye_ = eye_
+
+def constant(shape, dtype="float32", value=0.0):
+    '''Generate constant Jittor Var.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+        value (int or float):
+            value to be filled in output Var
+    
+    Return:
+        A Jittor Var which filled by constant value.
+
+    Example::
+
+        from jittor import init
+        print(init.constant(2))
+        # output: [0.,0.]
+        print(init.constant((2,3), value=1.))
+        # output: [[1.,1.,1.],[1.,1.,1.]]
+
+    '''
+    return jt.array(value).unary(dtype).broadcast(NanoVector(shape))

 def constant_(var, value=0.0):
+    ''' Inplace initialize variable with constant value.
+
+    Args:
+        var (Jittor Var):
+            Var to initialize with constant value.
+    
+    Return:
+        var itself.
+    
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.constant_(linear.weight)
+        print(linear.weight)
+        # output: [[0.,0.],[0.,0.]]
+        linear.weight.constant_() # This is ok too
+
+    '''
    return var.assign(constant(var.shape, var.dtype, value))
+Var.constant_ = constant_

-def uniform(shape, dtype, low, high):
-    return jt.random(shape, dtype) * (low - high) + high
+def zero(shape, dtype="float32"):
+    '''Generate zero Jittor Var.

-def uniform_(var, low, high):
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+    
+    Return:
+        A Jittor Var which filled by constant value.
+
+    Example::
+
+        from jittor import init
+        print(init.zero(2))
+        # output: [0.,0.]
+        print(init.zero((2,3)))
+        # output: [[0.,0.,0.],[0.,0.,0.]]
+
+    '''
+    return constant(shape, dtype, 0)
+def zero_(var):
+    ''' Inplace initialize variable with zero.
+
+    Args:
+        var (Jittor Var):
+            Var to initialize with zero.
+    
+    Return:
+        var itself.
+    
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.zero_(linear.weight)
+        print(linear.weight)
+        # output: [[0.,0.],[0.,0.]]
+        linear.weight.zero_() # This is ok too
+
+    '''
+    return var.assign(zero(var.shape, var.dtype))
+Var.zero_ = zero_
+def one(shape, dtype="float32"):
+    '''Generate Jittor Var filled by one.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+    
+    Return:
+        A Jittor Var which filled by one.
+
+    Example::
+
+        from jittor import init
+        print(init.one(2))
+        # output: [1.,1.]
+        print(init.one((2,3)))
+        # output: [[1.,1.,1.],[1.,1.,1.]]
+
+    '''
+    return constant(shape, dtype, 1)
+def one_(var):
+    ''' Inplace initialize variable with one.
+
+    Args:
+        var (Jittor Var):
+            Var to initialize with one.
+    
+    Return:
+        var itself.
+    
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.one_(linear.weight)
+        print(linear.weight)
+        # output: [[1.,1.],[1.,1.]]
+        linear.weight.one_() # This is ok too
+
+    '''
+    return var.assign(one(var.shape, var.dtype))
+Var.one_ = one_
+
+def uniform(shape, dtype="float32", low=0, high=1):
+    '''Generate random uniform Jittor Var.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+        low (int or float or Var):
+            lower bound value of the random uniform
+        high (int or float or Var):
+            upper bound value of the random uniform
+
+    Return:
+        A Jittor Var which filled by random uniform.
+
+    Example::
+
+        from jittor import init
+        print(init.uniform(5))
+        # output: [0.202268, 0.518688, 0.595274, 0.777354, 0.981979]
+        print(init.uniform((2,3), low=-1, high=1))
+        # output: [[ 0.6647397   0.2801202  -0.01981187]
+        #          [-0.9779438  -0.30149996  0.69056886]]
+
+    '''
+    return jt.random(NanoVector(shape), dtype) * (low - high) + high
+
+def uniform_(var, low=0, high=1):
+    ''' Inplace initialize Jittor Var by random uniform.
+
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random uniform
+        low (int or float or Var):
+            lower bound value of the random uniform
+        high (int or float or Var):
+            upper bound value of the random uniform
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.uniform_(linear.weight, -1.0, 1.0)
+        print(linear.weight)
+        # output: [[ 0.6647397   0.2801202], [-0.9779438  -0.30149996]]
+        linear.weight.uniform_(-1.0, 1.0) # This is ok too
+
+    '''
    return var.assign(uniform(var.shape, var.dtype, low, high))
+Var.uniform_ = uniform_

-def gauss(shape, dtype, mean=0.0, std=1.0):
-    return jt.random(shape, dtype, "normal") * std + mean
+def gauss(shape, dtype="float32", mean=0.0, std=1.0):
+    ''' Return Jittor Var initialize by random gauss.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+        mean (int or float or Var):
+            mean value of the random gauss
+        std (int or float or Var):
+            std value of the random gauss
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        a = init.gauss((2,2), "float32", 0.0, 1.0)
+        print(a)
+
+    '''
+    return jt.random(NanoVector(shape), dtype, "normal") * std + mean

 def gauss_(var, mean=0.0, std=1.0):
-    return var.assign(gauss(var.shape, var.dtype, mean, std))
+    ''' Inplace initialize Jittor Var by random gauss.

-def invariant_uniform(shape, dtype, mode="fan_in"):
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random gauss
+        mean (int or float or Var):
+            mean value of the random gauss
+        std (int or float or Var):
+            std value of the random gauss
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.gauss_(linear.weight, 0.0, 1.0)
+        print(linear.weight)
+        linear.weight.gauss_(0.0, 1.0) # This is ok too
+
+    '''
+    return var.assign(gauss(var.shape, var.dtype, mean, std))
+Var.gauss_ = gauss_
+
+def invariant_uniform(shape, dtype="float32", mode="fan_in"):
+    ''' Return Jittor initialized Var by invariant_uniform.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        a = init.invariant_uniform_((2,2))
+        print(a)
+
+    '''
    assert len(shape)>1
-    assert mode=="fan_in" or mode=="fan_out"
+    assert mode=="fan_in" or mode=="fan_out", \
+        f"mode not supported, should be fan_in or fan_out, but got {mode}"

    matsize=1
    for i in shape[2:]:
@ -47,9 +335,48 @@ def invariant_uniform(shape, dtype, mode="fan_in"):
    return uniform(shape, dtype, -bound, bound)

 def invariant_uniform_(var, mode="fan_in"):
-    var.assign(invariant_uniform(tuple(var.shape), var.dtype, mode))
+    ''' Inplace initialize Jittor Var by invariant_uniform.

-def relu_invariant_gauss(shape, dtype, mode="fan_in"):
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random invariant_uniform
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.invariant_uniform_(linear.weight)
+        print(linear.weight)
+        linear.weight.invariant_uniform_() # This is ok too
+
+    '''
+    var.assign(invariant_uniform(tuple(var.shape), var.dtype, mode))
+Var.invariant_uniform_ = invariant_uniform_
+
+def relu_invariant_gauss(shape, dtype="float32", mode="fan_in"):
+    ''' Return Jittor Var initialized by relu_invariant_gauss.
+
+    Args:
+        shape (int or tuple of int):
+            shape of the output Var
+        dtype (string):
+            dtype of the output Var, default float32
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        a = init.relu_invariant_gauss((2,2))
+        print(a)
+    
+    '''
    assert len(shape)>1
    assert mode=="fan_in" or mode=="fan_out"
    
@ -61,9 +388,29 @@ def relu_invariant_gauss(shape, dtype, mode="fan_in"):
    return gauss(shape, dtype, 0, std)

 def relu_invariant_gauss_(var, mode="fan_in"):
-    return var.assign(relu_invariant_gauss(tuple(var.shape), var.dtype, mode))
+    ''' Inplace initialize Jittor Var by relu_invariant_gauss.

-def calculate_std(var,mode,nonlinearity,param=0.01):
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random relu_invariant_gauss
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.relu_invariant_gauss_(linear.weight)
+        print(linear.weight)
+        linear.weight.relu_invariant_gauss_() # This is ok too
+
+    '''
+    return var.assign(relu_invariant_gauss(tuple(var.shape), var.dtype, mode))
+Var.relu_invariant_gauss_ = relu_invariant_gauss_
+
+def calculate_std(var, mode, nonlinearity, param=0.01):
    mode = mode.lower()
    assert isinstance(param,(int,float))
    assert var.ndim>=2
@ -91,16 +438,90 @@ def calculate_std(var,mode,nonlinearity,param=0.01):


 def kaiming_uniform_(var, a=0, mode='fan_in', nonlinearity='leaky_relu'):
+    ''' Inplace initialize Jittor Var by kaiming_uniform.
+
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random kaiming_uniform
+        a (float):
+            the negative slope of the rectifier used after this layer (only used with 'leaky_relu')
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+        nonlinearity (string):
+            nonlinearity used after this layer. 
+            It can be one of [linear, conv*, sigmoid, tanh, relu, leaky_relu].
+            leaky_relu is used by default.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.kaiming_uniform_(linear.weight)
+        print(linear.weight)
+        linear.weight.kaiming_uniform_() # This is ok too
+
+    '''
    std = calculate_std(var,mode,nonlinearity,a)
    bound = math.sqrt(3.0) * std
    return uniform_(var,-bound, bound)
+Var.kaiming_uniform_ = kaiming_uniform_

 def kaiming_normal_(var, a=0, mode='fan_in', nonlinearity='leaky_relu'):
+    ''' Inplace initialize Jittor Var by kaiming_normal.
+
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random kaiming_normal
+        a (float):
+            the negative slope of the rectifier used after this layer (only used with 'leaky_relu')
+        mode (string):
+            mode selection, should be fan_in or fan_out.
+            Choosing 'fan_in' preserves the magnitude of the variance of the weights in the forward pass. Choosing 'fan_out' preserves the magnitudes in the backwards pass.
+        nonlinearity (string):
+            nonlinearity used after this layer. 
+            It can be one of [linear, conv*, sigmoid, tanh, relu, leaky_relu].
+            leaky_relu is used by default.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.kaiming_normal_(linear.weight)
+        print(linear.weight)
+        linear.weight.kaiming_normal_() # This is ok too
+
+    '''
    std = calculate_std(var,mode,nonlinearity,a)
    return gauss_(var,0, std)
+Var.kaiming_normal_ = kaiming_normal_


-def xavier_uniform(shape, dtype, gain=1.0):
+def xavier_uniform(shape, dtype="float32", gain=1.0):
+    ''' Inplace initialize Jittor Var by xavier_uniform.
+    The resulting var will have values sampled from
+    :math:`uniform(-a, a)` where
+
+    .. math::
+        a = \text{gain} \times \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}
+
+    Args:
+        shape (int or tuple of int):
+            shape of the return Var.
+        dtype (string):
+            dtype of the return Var, default float32.
+        gain (float):
+            an optional scaling factor.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        a = init.xavier_uniform((2,2), gain=init.calculate_gain('relu'))
+        print(a)
+    '''
    assert len(shape)>1

    matsize=1
@ -111,9 +532,58 @@ def xavier_uniform(shape, dtype, gain=1.0):
    return uniform(shape, dtype, -bound, bound)

 def xavier_uniform_(var, gain=1.0):
-    return var.assign(xavier_uniform(tuple(var.shape), var.dtype, gain))
+    ''' Inplace initialize Jittor Var by xavier_uniform.
+    The resulting var will have values sampled from
+    :math:`uniform(-a, a)` where

-def xavier_gauss(shape, dtype, gain=1.0):
+    .. math::
+        a = \text{gain} \times \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}
+
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random xavier_uniform
+        gain (float):
+            an optional scaling factor.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.xavier_uniform_(linear.weight, init.calculate_gain('relu'))
+        print(linear.weight)
+        linear.weight.xavier_uniform_() # This is ok too
+
+    '''
+    return var.assign(xavier_uniform(tuple(var.shape), var.dtype, gain))
+Var.xavier_uniform_ = xavier_uniform_
+
+def xavier_gauss(shape, dtype="float32", gain=1.0):
+    ''' Return Jittor Var initialized by xavier_gauss, a.k.a xavier_normal.
+    The resulting var will have values sampled from
+    :math:`gauss(-a, a)` where
+
+    .. math::
+        \text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}}
+
+    Args:
+        shape (int or tuple of int):
+            shape of the return Var.
+        dtype (string):
+            dtype of the return Var, default float32.
+        gain (float):
+            an optional scaling factor.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.xavier_gauss_(linear.weight, init.calculate_gain('relu'))
+        print(linear.weight)
+        linear.weight.xavier_gauss_() # This is ok too
+
+    '''
    assert len(shape)>1
    
    matsize=1
@ -124,4 +594,74 @@ def xavier_gauss(shape, dtype, gain=1.0):
    return gauss(shape, dtype, 0, std)

 def xavier_gauss_(var, gain=1.0):
+    ''' Inplace initialize Jittor Var by xavier_gauss, a.k.a xavier_normal.
+    The resulting var will have values sampled from
+    :math:`gauss(-a, a)` where
+
+    .. math::
+        \text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}}
+
+    Args:
+        var (Jittor Var):
+            Var to be initialized by random xavier_gauss
+        gain (float):
+            an optional scaling factor.
+
+    Example::
+
+        from jittor import init
+        from jittor import nn
+        linear = nn.Linear(2,2)
+        init.xavier_gauss_(linear.weight, init.calculate_gain('relu'))
+        print(linear.weight)
+        linear.weight.xavier_gauss_() # This is ok too
+
+    '''
    return var.assign(xavier_gauss(tuple(var.shape), var.dtype, gain))
+Var.xavier_gauss_ = xavier_gauss_
+
+def calculate_gain(nonlinearity, param=None):
+    r"""Return the recommended gain value for the given nonlinearity function.
+    The values are as follows:
+
+    ================= ====================================================
+    nonlinearity      gain
+    ================= ====================================================
+    Linear / Identity :math:`1`
+    Conv{1,2,3}D      :math:`1`
+    Sigmoid           :math:`1`
+    Tanh              :math:`\frac{5}{3}`
+    ReLU              :math:`\sqrt{2}`
+    Leaky Relu        :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}`
+    SELU              :math:`\frac{3}{4}`
+    ================= ====================================================
+
+    Args:
+        nonlinearity: the non-linear function (`nn.functional` name)
+        param: optional parameter for the non-linear function
+
+    Examples:
+        >>> gain = nn.init.calculate_gain('leaky_relu', 0.2)  # leaky_relu with negative_slope=0.2
+
+    .. _Self-Normalizing Neural Networks: https://papers.nips.cc/paper/2017/hash/5d44ee6f2c3f71b73125876103c8f6c4-Abstract.html
+    """
+    linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d']
+    if nonlinearity in linear_fns or nonlinearity == 'sigmoid':
+        return 1
+    elif nonlinearity == 'tanh':
+        return 5.0 / 3
+    elif nonlinearity == 'relu':
+        return math.sqrt(2.0)
+    elif nonlinearity == 'leaky_relu':
+        if param is None:
+            negative_slope = 0.01
+        elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float):
+            # True/False are instances of int, hence check above
+            negative_slope = param
+        else:
+            raise ValueError("negative_slope {} not a valid number".format(param))
+        return math.sqrt(2.0 / (1 + negative_slope ** 2))
+    elif nonlinearity == 'selu':
+        return 3.0 / 4 
+    else:
+        raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))
--- a/python/jittor/misc.py
+++ b/python/jittor/misc.py
@ -937,7 +937,7 @@ Output::
        print(out)
    return tree, out
    
-def python_pass_warper(mod_func, args, kw):
+def python_pass_wrapper(mod_func, args, kw):
    import importlib
    mod, func = mod_func.rsplit(".", 1)
    mod = importlib.import_module(mod)
--- a/python/jittor/models/res2net.py
+++ b/python/jittor/models/res2net.py
@ -7,12 +7,12 @@ import math


 model_urls = {
-    'res2net50_14w_8s': 'https://cloud.tsinghua.edu.cn/f/2543e4b5646d40a1afa9/?dl=1&fname=/res2net50_14w_8s.pkl',
-    'res2net50_26w_4s': 'https://cloud.tsinghua.edu.cn/f/927fead9c9884f769d88/?dl=1&fname=/res2net50_26w_4s.pkl',
-    'res2net50_26w_6s': 'https://cloud.tsinghua.edu.cn/f/067875edf6ca488ba83e/?dl=1&fname=/res2net50_26w_6s.pkl',
-    'res2net50_26w_8s': 'https://cloud.tsinghua.edu.cn/f/ce1230155a2c4352bf17/?dl=1&fname=/res2net50_26w_8s.pkl',
-    'res2net50_48w_2s': 'https://cloud.tsinghua.edu.cn/f/b8a4df2b2cb64500b869/?dl=1&fname=/res2net50_48w_2s.pkl',
-    'res2net101_26w_4s': 'https://cloud.tsinghua.edu.cn/f/b85283bf572649d288bb/?dl=1&fname=/res2net101_26w_4s.pkl',
+    'res2net50_14w_8s': 'jittorhub://res2net50_14w_8s.pkl',
+    'res2net50_26w_4s': 'jittorhub://res2net50_26w_4s.pkl',
+    'res2net50_26w_6s': 'jittorhub://res2net50_26w_6s.pkl',
+    'res2net50_26w_8s': 'jittorhub://res2net50_26w_8s.pkl',
+    'res2net50_48w_2s': 'jittorhub://res2net50_48w_2s.pkl',
+    'res2net101_26w_4s': 'jittorhub://res2net101_26w_4s.pkl',
 }


--- a/python/jittor/nn.py
+++ b/python/jittor/nn.py
@ -155,22 +155,151 @@ jt.Var.__imatmul__ = lambda a,b: a.assign(matmul(a,b))
 def get_init_var_rand(shape, dtype):
    return jt.array(np.random.normal(0.0, 1.0, shape).astype(np.float32))

-def relu(x): return jt.ternary((x>0.0), x, jt.broadcast_var(0.0, x))
-def leaky_relu(x, scale=0.01): return jt.ternary(x>0, x, x*scale)
-def relu6(x): return jt.minimum(jt.maximum(x, 0.0), 6.0)
-def elu(x,alpha=1.0):return jt.ternary(x>0,x,alpha*(x.exp()-1))
-def sign(x):
+def relu(x): 
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{ReLU6}(x) = \max(0,x)
+
+    :param x: the input var
+    :type x: jt.Var
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 1.1338731   6.128115  ], dtype=float32)
+        >>> nn.relu(a)
+        jt.Var([0.        1.1338731 6.128115 ], dtype=float32)
+    '''
+    return jt.ternary((x>0.0), x, jt.broadcast_var(0.0, x))
+
+
+def leaky_relu(x, scale=0.01): 
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{LeakyRELU}(x) =
+        \begin{cases}
+        x, & \text{ if } x \geq 0 \\
+        \text{scale} \times x, & \text{ otherwise }
+        \end{cases}
+
+    :param x: the input var
+    :type x: jt.Var
+
+    :param scale: the :math:`\scale` value for the leaky relu formulation. Default: 0.01
+    :param scale: float, optional
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 1.1338731   6.128115  ], dtype=float32)
+        >>> nn.leaky_relu(a)
+        jt.Var([-3.8380371e-03  1.1338731e+00  6.1281152e+00], dtype=float32)
+    '''
+    return jt.ternary(x>0, x, x*scale)
+
+def relu6(x): 
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{ReLU6}(x) = \min(\max(0,x), 6)
+
+    :param x: the input var
+    :type x: jt.Var
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 1.1338731   6.128115  ], dtype=float32)
+        >>> nn.relu6(a)
+        jt.Var([0.        1.1338731 6.       ], dtype=float32)
+    '''
+    return jt.minimum(jt.maximum(x, 0.0), 6.0)
+
+def elu(x: jt.Var, alpha: float = 1.0) -> jt.Var:
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{ELU}(x) = \begin{cases}
+        x, & \text{ if } x > 0\\
+        \alpha * (\exp(x) - 1), & \text{ if } x \leq 0
+        \end{cases}
+
+    :param x: the input var
+    :type x: jt.Var
+
+    :param alpha: the :math:`\alpha` value for the ELU formulation. Default: 1.0
+    :param alpha: float, optional
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 -1.1338731   2.128115  ], dtype=float32)
+        >>> nn.elu(a)
+        jt.Var([-0.31873488 -0.6782155   2.128115  ], dtype=float32)
+    '''
+    return jt.ternary(x>0,x,alpha*(x.exp()-1))
+
+def sign(x: jt.Var) -> jt.Var:
+    ''' returns the signs of elements of x
+
+    :param x: the input Var
+    :type x: jt.Var
+
+    Example:
+        >>> a = jt.float32([0.99, 0, -0.99])
+        >>> nn.sign(a)
+        jt.Var([ 1.  0. -1.], dtype=float32)
+    '''
    one = jt.ones(x.shape)
    x = jt.ternary(x>0, one, x)
    return jt.ternary(x<0, -one, x)

 def gelu(x):
+    r''' Applies the element-wise function:
+
+   .. math:: \text{GELU}(x) = x * \Phi(x)
+
+    where :math:`\Phi(x)` is the Cumulative Distribution Function for Gaussian Distribution.
+
+    :param x: the input var
+    :type x: jt.Var
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 -1.1338731   2.128115  ], dtype=float32)
+        >>> nn.gelu(a)
+        jt.Var([-0.134547   0.9882567  6.128115 ], dtype=float32)
+    '''
    _sqrt2 = 1.4142135623730951
    erf = jt.erf(x/_sqrt2)+1
    r = erf*x*.5
    return r

 class ELU(Module):
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{ELU}(x) = \begin{cases}
+        x, & \text{ if } x > 0\\
+        \alpha * (\exp(x) - 1), & \text{ if } x \leq 0
+        \end{cases}
+
+    :param x: the input var
+    :type x: jt.Var
+
+    :param alpha: the :math:`\alpha` value for the ELU formulation. Default: 1.0
+    :param alpha: float, optional
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> a
+        jt.Var([-0.38380373 -1.1338731   2.128115  ], dtype=float32)
+        >>> nn.elu(a)
+        jt.Var([-0.31873488 -0.6782155   2.128115  ], dtype=float32)
+    '''
    def __init__(self,alpha=1.0):
        self.alpha=alpha
    
@ -178,6 +307,31 @@ class ELU(Module):
        return elu(x,self.alpha)

 class PReLU(Module):
+    r''' Applies the element-wise function:
+
+    .. math::
+        \text{PReLU}(x) =
+        \begin{cases}
+        x, & \text{ if } x \geq 0 \\
+        ax, & \text{ otherwise }
+        \end{cases}
+
+    :param x: the input var
+    :type x: jt.Var
+
+    :param num_parameters: number of :math:`a` to learn, can be either 1 or the number of channels at input. Default: 1
+    :type num_parameters: int, optional
+
+    :param init: the initial value of :math:`a`. Default: 0.25
+    :param init: float, optional
+
+    Example:
+        >>> a = jt.randn(3)
+        >>> prelu = nn.PReLU()
+        >>> prelu(a)
+        jt.Var([-0.09595093  1.1338731   6.128115  ], dtype=float32)
+    '''
+
    def __init__(self, num_parameters=1, init_=0.25):
        self.num_parameters = num_parameters
        self.weight = init.constant((num_parameters,), "float32", init_)
@ -600,6 +754,41 @@ GELU = jt.make_module(gelu)
 from jittor.depthwise_conv import DepthwiseConv

 class Conv(Module):
+    ''' Applies a 2D convolution over an input signal composed of several input planes.
+
+    :param in_channels: Number of channels in the input feature map
+    :type in_channels: int
+
+    :param out_channels: Number of channels in the output feature map
+    :type out_channels: int
+
+    :param kernel_size: Size of the convolving kernel
+    :type kernel_size: int or tuple
+
+    :param stride: Stride of the convolution. Default: 1
+    :type stride: int or tuple, optional
+
+    :param padding: Padding added to all four sides of the input. Default: 0
+    :type padding: int or tuple, optional
+
+    :param dilation: Spacing between kernel elements. Default: 1
+    :type dilation: int or tuple, optional
+
+    :param groups: Number of blocked connections from input channels to output channels. Default: 1
+    :type groups: int, optional
+
+    :param bias: If True, adds a learnable bias to the output. Default: True
+    :type bias: bool, optional
+
+    Example:
+
+    >>> conv = nn.Conv2d(24, 32, 3)
+    >>> conv = nn.Conv2d(24, 32, (3,3))
+    >>> conv = nn.Conv2d(24, 32, 3, stride=2, padding=1)
+    >>> conv = nn.Conv2d(24, 32, 3, dilation=(3, 1))
+    >>> input = jt.randn(4, 24, 100, 100)
+    >>> output = conv(input)
+    '''
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True):
        self.in_channels = in_channels
        self.out_channels = out_channels
@ -695,6 +884,41 @@ class Conv(Module):
 Conv2d = Conv

 class Conv1d(Module):
+    ''' Applies a 1D convolution over an input signal composed of several input planes.
+
+    :param in_channels: Number of channels in the input feature map
+    :type in_channels: int
+
+    :param out_channels: Number of channels in the output feature map
+    :type out_channels: int
+
+    :param kernel_size: Size of the convolving kernel
+    :type kernel_size: int or tuple
+
+    :param stride: Stride of the convolution. Default: 1
+    :type stride: int or tuple, optional
+
+    :param padding: Padding added to all four sides of the input. Default: 0
+    :type padding: int or tuple, optional
+
+    :param dilation: Spacing between kernel elements. Default: 1
+    :type dilation: int or tuple, optional
+
+    :param groups: Number of blocked connections from input channels to output channels. Default: 1
+    :type groups: int, optional
+
+    :param bias: If True, adds a learnable bias to the output. Default: True
+    :type bias: bool, optional
+
+    Example:
+
+    >>> conv = nn.Conv1d(24, 32, 3)
+    >>> conv = nn.Conv1d(24, 32, (3,3))
+    >>> conv = nn.Conv1d(24, 32, 3, stride=2, padding=1)
+    >>> conv = nn.Conv1d(24, 32, 3, dilation=(3, 1))
+    >>> input = jt.randn(4, 24, 100)
+    >>> output = conv(input)
+    '''
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True):
        self.in_channels = in_channels
        self.out_channels = out_channels
@ -721,6 +945,41 @@ class Conv1d(Module):
        return y

 class Conv3d(Module):
+    ''' Applies a 3D convolution over an input signal composed of several input planes.
+
+    :param in_channels: Number of channels in the input feature map
+    :type in_channels: int
+
+    :param out_channels: Number of channels in the output feature map
+    :type out_channels: int
+
+    :param kernel_size: Size of the convolving kernel
+    :type kernel_size: int or tuple
+
+    :param stride: Stride of the convolution. Default: 1
+    :type stride: int or tuple, optional
+
+    :param padding: Padding added to all four sides of the input. Default: 0
+    :type padding: int or tuple, optional
+
+    :param dilation: Spacing between kernel elements. Default: 1
+    :type dilation: int or tuple, optional
+
+    :param groups: Number of blocked connections from input channels to output channels. Default: 1
+    :type groups: int, optional
+
+    :param bias: If True, adds a learnable bias to the output. Default: True
+    :type bias: bool, optional
+
+    Example:
+
+    >>> conv = nn.Conv3d(24, 32, 3)
+    >>> conv = nn.Conv3d(24, 32, (3,3))
+    >>> conv = nn.Conv3d(24, 32, 3, stride=2, padding=1)
+    >>> conv = nn.Conv3d(24, 32, 3, dilation=(3, 1))
+    >>> input = jt.randn(4, 24, 50, 50, 50)
+    >>> output = conv(input)
+    '''
    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True):
        self.in_channels = in_channels
        self.out_channels = out_channels
@ -750,6 +1009,35 @@ class Conv3d(Module):
        return conv3d(x, self.weight, self.bias, self.stride, self.padding, self.dilation, self.groups)

 def conv2d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    ''' Applies a 2D convolution over an input signal composed of several input planes.
+
+    :param x: the input image
+    :type x: jt.Var
+
+    :param weight: the convolution kernel
+    :type weight: jt.Var
+
+    :param bias: the bias after convolution
+    :type bias: jt,Var, optional
+
+    :param stride: Stride of the convolution. Default: 1
+    :type stride: int or tuple, optional
+
+    :param padding: Padding added to all four sides of the input. Default: 0
+    :type padding: int or tuple, optional
+
+    :param dilation: Spacing between kernel elements. Default: 1
+    :type dilation: int or tuple, optional
+
+    :param groups: Number of blocked connections from input channels to output channels. Default: 1
+    :type groups: int, optional
+
+    Example:
+
+    >>> x = jt.randn(4, 24, 100, 100)
+    >>> w = jt.randn(32, 24, 3, 3)
+    >>> y = nn.conv2d(x, w)
+    '''
    padding = _pair(padding)
    stride = _pair(stride)
    dilation = _pair(dilation)
@ -808,6 +1096,35 @@ def conv2d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
        return y

 def conv3d(x, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    ''' Applies a 3D convolution over an input signal composed of several input planes.
+
+    :param x: the input volume
+    :type x: jt.Var
+
+    :param weight: the convolution kernel
+    :type weight: jt.Var
+
+    :param bias: the bias after convolution
+    :type bias: jt,Var, optional
+
+    :param stride: Stride of the convolution. Default: 1
+    :type stride: int or tuple, optional
+
+    :param padding: Padding added to all four sides of the input. Default: 0
+    :type padding: int or tuple, optional
+
+    :param dilation: Spacing between kernel elements. Default: 1
+    :type dilation: int or tuple, optional
+
+    :param groups: Number of blocked connections from input channels to output channels. Default: 1
+    :type groups: int, optional
+
+    Example:
+
+    >>> x = jt.randn(4, 24, 50, 50, 50)
+    >>> w = jt.randn(32, 24, 3, 3, 3)
+    >>> y = nn.conv2d(x, w)
+    '''
    padding = _triple(padding)
    stride = _triple(stride)
    dilation = _triple(dilation)
@ -1171,13 +1488,30 @@ class ReplicationPad2d(Module):
        ])

 class Embedding(Module):
+    ''' A simple lookup table that stores embeddings of a fixed dictionary and size.
+
+        :param num: size of the dictionary of embeddings
+        :type num: int
+
+        :param dim: the size of each embedding vector
+        :type dim: int
+
+        Example:
+            >>> embedding = nn.Embedding(10, 3)
+            >>> x = jt.int32([1, 2, 3, 3])
+            >>> embedding(x)
+            jt.Var([[ 1.1128596   0.19169547  0.706642]
+             [ 1.2047412   1.9668795   0.9932192]
+             [ 0.14941819  0.57047683 -1.3217674]
+             [ 0.14941819  0.57047683 -1.3217674]], dtype=float32)
+    '''
    def __init__(self, num, dim):
        self.num = num
        self.dim = dim
        self.weight = jt.init.gauss([num,dim],'float32').stop_grad()

    def execute(self, x):
-        res = self.weight[x].reshape([x.shape[0],self.dim])
+        res = self.weight[x.flatten()].reshape(x.shape + [self.dim])
        return res

 class PixelShuffle(Module):
@ -1764,30 +2098,30 @@ ModuleList = Sequential


 class LSTMCell(jt.Module):
+    ''' A long short-term memory (LSTM) cell.
+
+    :param input_size: The number of expected features in the input
+    :type input_size: int
+
+    :param hidden_size: The number of features in the hidden state
+    :type hidden_size: int
+
+    :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
+    :type bias: bool, optional
+
+    Example:
+
+    >>> rnn = nn.LSTMCell(10, 20) # (input_size, hidden_size)
+    >>> input = jt.randn(2, 3, 10) # (time_steps, batch, input_size)
+    >>> hx = jt.randn(3, 20) # (batch, hidden_size)
+    >>> cx = jt.randn(3, 20)
+    >>> output = []
+    >>> for i in range(input.shape[0]):
+            hx, cx = rnn(input[i], (hx, cx))
+            output.append(hx)
+    >>> output = jt.stack(output, dim=0)
+    '''
    def __init__(self, input_size, hidden_size, bias=True):
-        ''' A long short-term memory (LSTM) cell.
-
-        :param input_size: The number of expected features in the input
-        :type input_size: int
-
-        :param hidden_size: The number of features in the hidden state
-        :type hidden_size: int
-
-        :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
-        :type bias: bool, optional
-
-        Example:
-
-        >>> rnn = nn.LSTMCell(10, 20) # (input_size, hidden_size)
-        >>> input = jt.randn(2, 3, 10) # (time_steps, batch, input_size)
-        >>> hx = jt.randn(3, 20) # (batch, hidden_size)
-        >>> cx = jt.randn(3, 20)
-        >>> output = []
-        >>> for i in range(input.shape[0]):
-                hx, cx = rnn(input[i], (hx, cx))
-                output.append(hx)
-        >>> output = jt.stack(output, dim=0)
-        '''
        super().__init__()

        self.hidden_size = hidden_size
@ -1825,31 +2159,31 @@ class LSTMCell(jt.Module):


 class RNNCell(jt.Module):
+    ''' An Elman RNN cell with tanh or ReLU non-linearity.
+
+    :param input_size: The number of expected features in the input
+    :type input_size: int
+
+    :param hidden_size: The number of features in the hidden state
+    :type hidden_size: int
+
+    :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
+    :type bias: bool, optional
+
+    :param nonlinearity: The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'.
+    :type nonlinearity: str, optional
+
+    Example:
+
+    >>> rnn = nn.RNNCell(10, 20)
+    >>> input = jt.randn((6, 3, 10))
+    >>> hx = jt.randn((3, 20))
+    >>> output = []
+    >>> for i in range(6):
+            hx = rnn(input[i], hx)
+            output.append(hx)
+    '''
    def __init__(self, input_size, hidden_size, bias=True, nonlinearity = "tanh"):
-        ''' An Elman RNN cell with tanh or ReLU non-linearity.
-
-        :param input_size: The number of expected features in the input
-        :type input_size: int
-
-        :param hidden_size: The number of features in the hidden state
-        :type hidden_size: int
-
-        :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
-        :type bias: bool, optional
-
-        :param nonlinearity: The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'.
-        :type nonlinearity: str, optional
-
-        Example:
-
-        >>> rnn = nn.RNNCell(10, 20)
-        >>> input = jt.randn((6, 3, 10))
-        >>> hx = jt.randn((3, 20))
-        >>> output = []
-        >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
-        '''
        super().__init__()

        self.hidden_size = hidden_size
@ -1884,28 +2218,28 @@ class RNNCell(jt.Module):


 class GRUCell(jt.Module):
+    ''' A gated recurrent unit (GRU) cell.
+
+    :param input_size: The number of expected features in the input
+    :type input_size: int
+
+    :param hidden_size: The number of features in the hidden state
+    :type hidden_size: int
+
+    :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
+    :type bias: bool, optional
+
+    Example:
+
+    >>> rnn = nn.GRUCell(10, 20)
+    >>> input = jt.randn((6, 3, 10))
+    >>> hx = jt.randn((3, 20))
+    >>> output = []
+    >>> for i in range(6):
+            hx = rnn(input[i], hx)
+            output.append(hx)
+    '''
    def __init__(self, input_size, hidden_size, bias=True):
-        ''' A gated recurrent unit (GRU) cell.
-
-        :param input_size: The number of expected features in the input
-        :type input_size: int
-
-        :param hidden_size: The number of features in the hidden state
-        :type hidden_size: int
-
-        :param bias: If False, then the layer does not use bias weights b_ih and b_hh. Default: True.
-        :type bias: bool, optional
-
-        Example:
-
-        >>> rnn = nn.GRUCell(10, 20)
-        >>> input = jt.randn((6, 3, 10))
-        >>> hx = jt.randn((3, 20))
-        >>> output = []
-        >>> for i in range(6):
-                hx = rnn(input[i], hx)
-                output.append(hx)
-        '''
        super().__init__()

        self.hidden_size = hidden_size
--- a/python/jittor/optim.py
+++ b/python/jittor/optim.py
@ -122,11 +122,19 @@ class Optimizer(object):

        # sync grads and model if in mpi
        if jt.in_mpi:
+            dep = []
+            def add_dep(v):
+                nonlocal dep
+                v._add_dependency(dep)
+                dep = [v]
+
            for g in grads:
                g.assign(g.mpi_all_reduce("mean"))
+                add_dep(g._input(0))
            if self.n_step % self.param_sync_iter == 0:
                for p in params:
                    p.assign(p.mpi_broadcast())
+                    add_dep(p)
        self.n_step += 1

        # set up grads in param_groups
--- a/python/jittor/pyjt_compiler.py
+++ b/python/jittor/pyjt_compiler.py
@ -692,13 +692,20 @@ def compile_src(src, h, basename):
            }} catch (const std::exception& e) {{
                if (!PyErr_Occurred()) {{
                    std::stringstream ss;
-                    ss {error_log_code};
-                    PyErr_Format(PyExc_RuntimeError, 
-                        "%s\\n%s\\nFailed reason:%s",
-                        ss.str().c_str(),
-                        R""({decs})"",
-                        e.what()
-                    );
+                    if (check_async_executor_error(e, ss)) {{
+                        PyErr_Format(PyExc_RuntimeError, 
+                            "%s",
+                            ss.str().c_str()
+                        );
+                    }} else {{
+                        ss {error_log_code};
+                        PyErr_Format(PyExc_RuntimeError, 
+                            "%s\\n%s\\nFailed reason:%s",
+                            ss.str().c_str(),
+                            R""({decs})"",
+                            e.what()
+                        );
+                    }}
                }}
            }}
            {func_return_failed};
--- a/python/jittor/src/executor.cc
+++ b/python/jittor/src/executor.cc
@ -29,6 +29,7 @@
 #include "misc/nan_checker.h"
 #include "memory_profiler.h"
 #include "utils/seh.h"
+#include "utils/cache_compile.h"

 namespace jittor {

@ -101,6 +102,55 @@ void load_fused_op(FusedOp& fused_op, vector<int>& fuse_ops, vector<Op*>& ops, i
    }
 }

+void check_op_async_error(Op* op, bool is_fused_op, const std::exception& e, jittor::Log& logf) {
+    vector<Stack> stack;
+    if (is_fused_op) {
+        FusedOp& fused_op = *((FusedOp*)op);
+        logf >> "[OP TYPE]:" << "fused_op:(";
+        for (auto& op : fused_op.ops)
+            logf << op->name_ex() >> ",";
+        logf >> ")\n";
+        logf >> "[Input]:";
+        for (auto& vi : fused_op.vars)
+            if (vi.type == 0) logf << vi.var->dtype() >> vi.var->shape >> vi.var->name >> ",";
+        logf << "\n[Output]:";
+        Var* ov = nullptr;
+        for (auto& vi : fused_op.vars)
+            if (vi.type == 2) {
+                logf << vi.var->dtype() >> vi.var->shape >> vi.var->name >> ",";
+                ov = vi.var;
+            }
+        if (ov)
+            stack = get_node_trace(ov);
+    } else {
+        logf >> "[OP TYPE]:" << op->name_ex();
+        logf << "\n[Input]:";
+        for (auto v : op->inputs())
+            logf << v->dtype() >> v->shape >> v->name >> ",";
+        logf << "\n[Output]:";
+        Var* ov = nullptr;
+        for (auto v : op->outputs()) {
+            logf << v->dtype() >> v->shape >> v->name >> ",";
+            ov = v;
+        }
+        if (ov)
+            stack = get_node_trace(ov);
+    }
+    logf << "\n[Async Backtrace]:";
+    if (stack.size()) {
+        logf << "---";
+        for (auto& s : stack) {
+            logf << "\n    " << s.file_path >> ":" >> s.lineno;
+            if (s.module_type.size()) logf << '<' >> s.module_type >> '>';
+            if (s.module_name.size() && s.module_name.find(":") == string::npos)
+                logf << '[' >> s.module_name >> ']';
+        }
+    } else
+        logf << "not found, please set env JT_SYNC=1, trace_py_var=3";
+    logf << "\n[Reason]:" << e.what();
+    jittor::LogFatalVoidify() && logf;
+}
+
 void Executor::run_sync(vector<Var*> vars, bool device_sync) {
    auto allocator = get_allocator();
    auto temp_allocator = get_allocator(true);
@ -531,13 +581,12 @@ void Executor::run_sync(vector<Var*> vars, bool device_sync) {
            // log jit_key and file location
            op->do_prepare(jkl);
            string jit_src_path = Op::get_filename_from_jit_key(jkl.to_cstring(), ".cc");
-            LOGe << "[Error] source file location:" << jit_src_path;
-            if (is_fused_op) {
-                LOGf << "Execute fused operator(" >> rid >> '/' >> queue.size() >> ")"
-                    << "failed:" << fused_op.ops << "\n\nReason: " >> e.what();
-            } else
-                LOGf << "Execute operator(" >> rid >> '/' >> queue.size() >> ")"
-                    << "failed:" << op << "\n\nReason: " >> e.what();
+            jittor::Log logf(__FILELINE__, 'f', 0);
+            logf << "\nExecute fused operator(" >> rid >> '/' >> queue.size() >> ")"
+                << "failed.";
+            if (jit_compiler::file_exist(jit_src_path))
+                logf << "\n[JIT Source]:" << jit_src_path << "\n";
+            check_op_async_error(op, is_fused_op, e, logf);
        }
    }
    LOGvv << "All" << op_num << "ops finished, return vars:" << vars;
--- a/python/jittor/src/mem/allocator.cc
+++ b/python/jittor/src/mem/allocator.cc
@ -107,7 +107,7 @@ void migrate_to_cpu(Var* var, Allocator* allocator) {
    if (!use_cuda_managed_allocator) {
        // must be a device allocator
        Allocation a(allocator, var->size);
-        checkCudaErrors(cudaMemcpy(a.ptr, var->mem_ptr, var->size, cudaMemcpyDeviceToHost));
+        checkCudaErrors(cudaMemcpy(a.ptr, var->mem_ptr, var->size, cudaMemcpyDefault));
        var->allocator->free(var->mem_ptr, var->size, var->allocation);
        var->mem_ptr = a.ptr;
        var->allocation = a.allocation;
--- a/python/jittor/src/mem/mem_info.cc
+++ b/python/jittor/src/mem/mem_info.cc
@ -128,6 +128,7 @@ void display_memory_info(const char* fileline, bool dump_var, bool red_color) {
    log << "cpu&gpu:" << FloatOutput{(double)all_total, " KMG", 1024, "B"}
        << "gpu:" << FloatOutput{(double)gpu_total, " KMG", 1024, "B"}
        << "cpu:" << FloatOutput{(double)cpu_total, " KMG", 1024, "B"} >> '\n';
+    
    size_t cpu_free = 0;
 #if defined(__linux__)
    cpu_free = get_avphys_pages() * sysconf(_SC_PAGESIZE);
@ -185,6 +186,20 @@ void display_memory_info(const char* fileline, bool dump_var, bool red_color) {
        }
    }
    log >> "===========================\n";
+
+    if (red_color) {
+        bool gpu_overflow = (double)gpu_total>(double)mem_info.total_cuda_ram*0.95;
+        bool cpu_overflow = (double)cpu_total>(double)mem_info.total_cpu_ram*0.95;
+        if(gpu_overflow || cpu_overflow) {
+            double used = gpu_overflow ? (double)gpu_total : (double)cpu_total;
+            double total = gpu_overflow ? (double)mem_info.total_cuda_ram : (double)mem_info.total_cpu_ram;
+            log.end();
+            LOGf << "\n*******************\n"
+                >> (gpu_overflow?"GPU":"CPU") << "memory is overflow, please reduce your batch_size or data size!\nTotal:" << FloatOutput{(double)total, " KMG", 1024, "B"} << "Used:" << FloatOutput{(double)used, " KMG", 1024, "B"};
+        } else
+            return;
+    }
+
    log.end();
 }

--- a/python/jittor/src/misc/cpu_math.cc
+++ b/python/jittor/src/misc/cpu_math.cc
@ -0,0 +1,57 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#include <cmath>
+#include <limits>
+#include "misc/cpu_math.h"
+
+namespace jittor {
+
+#define CENTRAL_RANGE 0.7
+
+template <typename T>
+static inline typename std::enable_if<std::is_floating_point<T>::value, T>::type
+calc_erfinv(T y) {
+/* Function to calculate inverse error function.  Rational approximation
+is used to generate an initial approximation, which is then improved to
+full accuracy by two steps of Newton's method.  Code is a direct
+translation of the erfinv m file in matlab version 2.0.
+Author:  Gary L. Pavlis, Indiana University
+Date:  February 1996
+*/
+  T x, z, num, dem; /*working variables */
+  /* coefficients in rational expansion */
+  T a[4]={ 0.886226899, -1.645349621,  0.914624893, -0.140543331};
+  T b[4]={-2.118377725,  1.442710462, -0.329097515,  0.012229801};
+  T c[4]={-1.970840454, -1.624906493,  3.429567803,  1.641345311};
+  T d[2]={ 3.543889200,  1.637067800};
+  T y_abs = std::abs(y);
+  if(y_abs > 1.0) return std::numeric_limits<T>::quiet_NaN();
+  if(y_abs == 1.0) return std::copysign(std::numeric_limits<T>::infinity(), y);
+  if(y_abs <= static_cast<T>(CENTRAL_RANGE)) {
+    z = y * y;
+    num = (((a[3]*z + a[2])*z + a[1])*z + a[0]);
+    dem = ((((b[3]*z + b[2])*z + b[1])*z +b[0]) * z + static_cast<T>(1.0));
+    x = y * num / dem;
+  }
+  else{
+    z = std::sqrt(-std::log((static_cast<T>(1.0)-y_abs)/static_cast<T>(2.0)));
+    num = ((c[3]*z + c[2])*z + c[1]) * z + c[0];
+    dem = (d[1]*z + d[0])*z + static_cast<T>(1.0);
+    x = std::copysign(num, y) / dem;
+  }
+  /* Two steps of Newton-Raphson correction */
+  x = x - (std::erf(x) - y) / ((static_cast<T>(2.0)/static_cast<T>(std::sqrt(M_PI)))*std::exp(-x*x));
+  x = x - (std::erf(x) - y) / ((static_cast<T>(2.0)/static_cast<T>(std::sqrt(M_PI)))*std::exp(-x*x));
+
+  return x;
+}
+
+float _erfinv(float y) { return calc_erfinv(y); };
+double _erfinv(double y) { return calc_erfinv(y); };
+
+}
+
--- a/python/jittor/src/misc/cpu_math.h
+++ b/python/jittor/src/misc/cpu_math.h
@ -0,0 +1,16 @@
+// ***************************************************************
+// Copyright (c) 2021 Jittor. All Rights Reserved. 
+// Maintainers: Dun Liang <randonlang@gmail.com>. 
+// This file is subject to the terms and conditions defined in
+// file 'LICENSE.txt', which is part of this source code package.
+// ***************************************************************
+#pragma once
+#include "common.h"
+
+namespace jittor {
+
+float _erfinv(float y);
+double _erfinv(double y);
+
+}
+
--- a/python/jittor/src/misc/cuda_flags.cc
+++ b/python/jittor/src/misc/cuda_flags.cc
@ -22,7 +22,13 @@ void setter_use_cuda(int value) {
    if (value) {
        int count=0;
        cudaGetDeviceCount(&count);
-        CHECK(count>0) << "No device found.";
+        if (count == 0) {
+            if (getenv("CUDA_VISIBLE_DEVICES")) {
+                LOGf << "No device found, please unset your "
+                "enviroment variable 'CUDA_VISIBLE_DEVICES'";
+            } else
+                LOGf << "No device found";
+        }
        LOGi << "CUDA enabled.";
    } else {
        LOGv << "CUDA disabled.";
--- a/python/jittor/src/misc/nano_string.cc
+++ b/python/jittor/src/misc/nano_string.cc
@ -85,7 +85,8 @@ static unordered_set<string> unary_ops = {
    "cosh",
    "acosh",
    "sigmoid",
-    "erf"
+    "erf",
+    "erfinv"
 };

 static unordered_set<string> unary_float_ops = {
--- a/python/jittor/src/misc/nano_string.h
+++ b/python/jittor/src/misc/nano_string.h
@ -80,6 +80,7 @@ constexpr int ns_max_len = 16;
    m(cosh) \
    m(acosh) \
    m(erf) \
+    m(erfinv) \
    m(sigmoid) \
    \
    m(uniform) \
--- a/python/jittor/src/misc/nano_vector.h
+++ b/python/jittor/src/misc/nano_vector.h
@ -155,6 +155,7 @@ struct NanoVector {
        return nv;
    }

+    // @pyjt(__init__)
    inline NanoVector(int64 x) { push_back(x); }

    // @pyjt(__repr__)
--- a/python/jittor/src/op_compiler.cc
+++ b/python/jittor/src/op_compiler.cc
@ -856,14 +856,19 @@ string OpCompiler::__get_fused_src(
            string arg_name = op_name + "_output";
            string argp_name = op_name + "_outputp";
            string T = ((ArrayOp*)ops[oi])->output->dtype().to_cstring();
-            fused_kernel_args += "    ArrayOp* " + op_name + " = (ArrayOp*)(ops[" + S(oi) + "]);\n";
-            // op_name = "((ArrayOp*)(ops[" + S(oi) + "]))";
-            fused_kernel_args += "    Var* " + arg_name + " = " + op_name + "->output;\n";

-            fused_kernel += "    auto* " + argp_name + " = " + arg_name + "->ptr<" + T + ">();\n";
-            fused_kernel += "    " + argp_name + "[0] = " + op_name + "->ptr<" + T + ">()[0];\n";
-            fused_kernel += "    int " + arg_name + "shape0 = 1;\n";
-            fused_kernel += "    int " + arg_name + "stride0 = 1;\n";
+            fused_kernel_args += precompile({{"oi",S(oi)}, {"T", T}}, R"(
+                Var* op@oi@@_output = ((ArrayOp*)(ops[@oi]))->output;
+                @T op@oi@@_outputv = ((ArrayOp*)(ops[@oi]))->ptr<@T>()[0];
+            )");
+
+
+            fused_kernel += precompile({{"oi",S(oi)}, {"T", T}}, R"(
+                @T* op@oi@@_outputp = op@oi@@_output->ptr<@T>();
+                op@oi@@_outputp[0] = op@oi@@_outputv;
+            )");
+
+

            fused_includes += "#include \"ops/array_op.h\"\n";
            op_members[oi].push_back(arg_name);
--- a/python/jittor/src/ops/getitem_op.cc
+++ b/python/jittor/src/ops/getitem_op.cc
@ -323,7 +323,7 @@ void GetitemOp::_compile_optimize(string& src) {
        }
        if (!has_zero) {
            func->push_back("int no = o_shape.size();");
-            func->push_back("int masks[no];");
+            func->push_back("STACK_ALLOC(int,masks,no);");
            func->push_back("int tdims[6];");
            func->push_back("cuda_loop_schedule(o_shape, masks, tdims);");
            func->push_back("dim3 grid_dim(tdims[3],tdims[4],tdims[5]);");
--- a/python/jittor/src/ops/reduce_op.cc
+++ b/python/jittor/src/ops/reduce_op.cc
@ -34,7 +34,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -65,7 +65,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -96,7 +96,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -127,7 +127,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -158,7 +158,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -189,7 +189,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

@ -224,7 +224,7 @@ unordered_set<string> reduce_ops = {

    * [in] x:       the input jt.Var.

-    * [in] dim:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).
+    * [in] dim or dims:     int or tuples of ints (optional). If specified, reduce along the given the dimension(s).

    * [in] keepdims: bool (optional). Whether the output has ``dim`` retained or not. Defaults to be False.

--- a/python/jittor/src/ops/unary_op.cc
+++ b/python/jittor/src/ops/unary_op.cc
@ -5,6 +5,7 @@
 // file 'LICENSE.txt', which is part of this source code package.
 // ***************************************************************
 #include <cmath>
+#include "misc/cpu_math.h"
 #include "var.h"
 #include "ops/unary_op.h"
 #include "ops/unary_op_defs.h"
@ -523,6 +524,7 @@ static unordered_set<string> unary_ops = {
        jt.Var([ 0.51559156  0.45739546 -0.85728306 -0.9258883 ], dtype=float32)
     */
    "erf",
+    "erfinv",
 };

 UnaryOp::UnaryOp(Var* x, NanoString op) : x(x) {
@ -659,6 +661,14 @@ VarPtr UnaryOp::grad(Var* out, Var* dout, Var* v, int v_index) {
        r = make_binary(r, two_div_sqrt_pi, ns_multiply);
        return make_binary(dout, r, ns_multiply);
    }
+    // derfinv(x) = sqrt(pi) / 2 * exp(erfinv(x)^2)
+    if (ns == ns_erfinv) {
+        auto sqrt_pi_div_two = make_number(1.7724538509055159/2, x);
+        auto y2 = make_binary(y, y, ns_multiply);
+        auto r = make_unary(y2, ns_exp);
+        r = make_binary(r, sqrt_pi_div_two, ns_multiply);
+        return make_binary(dout, r, ns_multiply);
+    }
    return nullptr;
 }

--- a/python/jittor/src/ops/unary_op_defs.h
+++ b/python/jittor/src/ops/unary_op_defs.h
@ -43,6 +43,7 @@ namespace jittor {
 #define sigmoid(T,x) ((T) (1.0f/(1.0f+::expf((::min(T(-(x)), T(@if(@strcmp(@T,float32)==0,30,300))))))))

 #define erf(T,x) ((T) ::erff((x)))
+#define erfinv(T,x) ((T) ::erfinvf((T)(x)))

 #else
 #define abs(T,x) std::abs(x)
@ -74,6 +75,7 @@ namespace jittor {
 #define sigmoid(T,x) ((T) (1.0f/(1.0f+std::exp(std::min(T(-(x)), T(@if(@strcmp(@T,float32)==0,30,300)))))))

 #define erf(T,x) ((T) std::erf((x)))
+#define erfinv(T,x) (jittor::_erfinv(x))

 #endif

--- a/python/jittor/src/opt/pass/loop_to_func_pass.cc
+++ b/python/jittor/src/opt/pass/loop_to_func_pass.cc
@ -56,6 +56,11 @@ void LoopToFuncPass::run() {
                if (d->has_attr("rvalue")) {
                    auto& rvalue = d->attrs["rvalue"];
                    auto& dtype = d->attrs["dtype"];
+                    if (endswith(d->attrs["lvalue"], "_value") ||
+                        endswith(d->attrs["lvalue"], "_outputv")) {
+                        args.push_back(d.get());
+                        continue;
+                    }
                    if (rvalue.find("ops") != string::npos)
                        continue;
                    if (dtype=="Var*")
@ -67,10 +72,6 @@ void LoopToFuncPass::run() {
                        args.push_back(d.get());
                        continue;
                    }
-                    if (endswith(d->attrs["lvalue"], "_value")) {
-                        args.push_back(d.get());
-                        continue;
-                    }
                }
            }
            func->push_back(d->clone());
--- a/python/jittor/src/opt/pass/loop_var_analyze_pass.cc
+++ b/python/jittor/src/opt/pass/loop_var_analyze_pass.cc
@ -122,6 +122,9 @@ void LoopVarAnalyzePass::run() {
                && (op->outputs().front()->shape.size() != max_elm_dim || 
                    std::abs(op->outputs().front()->num) != max_elm_size))
                continue;
+            if (op->name_ex() == "array")
+                // array op should not be loop var
+                continue;
            Var* loop_var;
            if (op->type() == OpType::broadcast || op->name_ex() == "index") {
                loop_var = op->output(0);
--- a/python/jittor/src/opt/pass_manager.cc
+++ b/python/jittor/src/opt/pass_manager.cc
@ -69,7 +69,7 @@ void PassManager::run_passes() {
        if (oc->op->flags.get(NodeFlags::_cuda)) {
            ir.children.back()->erase();
            string type = oc->op->ops[0]->outputs().front()->dtype().to_cstring();
-            ir.push_back("kernel<<<1,1>>>(op0_outputp, op0->ptr<"+type+">()[0]);");
+            ir.push_back("kernel<<<1,1>>>(op0_outputp, op0_outputv);");
            auto jt_type = type == "bool" ? type : "jittor::" + type;
            ir.push_back("__global__ static void kernel("+jt_type+"* xp, "+jt_type+" x) { xp[0] = x; } ", &ir.before, true);
        }
--- a/python/jittor/src/pyjt/py_caller.cc
+++ b/python/jittor/src/pyjt/py_caller.cc
@ -14,7 +14,7 @@ namespace jittor {

 string py_caller(const string& mod_func, const vector<string>& args, const map<string,string>& kw) {
    PyObjHolder mod(PyImport_ImportModule("jittor"));
-    PyObjHolder func(PyObject_GetAttrString(mod.obj, "python_pass_warper"));
+    PyObjHolder func(PyObject_GetAttrString(mod.obj, "python_pass_wrapper"));
    PyObjHolder py_name(to_py_object<string>(mod_func));
    PyObjHolder py_args(to_py_tuple(args));
    PyObjHolder py_kw(to_py_object(kw));
--- a/python/jittor/src/pyjt/py_converter.h
+++ b/python/jittor/src/pyjt/py_converter.h
@ -786,5 +786,6 @@ DEF_IS(VarSlices, T) from_py_object(PyObject* obj, vector<unique_ptr<VarHolder>>
    }
 }

+EXTERN_LIB bool check_async_executor_error(const std::exception& e, std::ostream& os);

 } // jittor
--- a/python/jittor/src/utils/jit_utils.cc
+++ b/python/jittor/src/utils/jit_utils.cc
@ -23,6 +23,34 @@

 namespace jittor {

+bool check_async_executor_error(const std::exception& e, std::ostream& os) {
+    if (!e.what()) return false;
+    auto s = string(e.what());
+    if (s.find("executor.cc:") == string::npos) 
+        return false;
+    os << s;
+    if (getenv("JT_SYNC") && getenv("trace_py_var"))
+        return true;
+    if (s.find("[Async Backtrace]: ---") != string::npos)
+        return true;
+    os << "\n**********\nAsync error was detected. "
+        "To locate the async backtrace and get better error report, please rerun your code with "
+        "two enviroment variables set:\n"
+        #ifdef _WIN32
+        "cmd: \n"
+        ">>> set JT_SYNC=1\n"
+        ">>> set trace_py_var=3\n"
+        "powershell: \n"
+        ">>> $env:JT_SYNC=1\n"
+        ">>> $env:trace_py_var=3\n"
+        #else
+        ">>> export JT_SYNC=1\n"
+        ">>> export trace_py_var=3\n"
+        #endif
+        ;
+    return true;
+}
+
 SEH_HOOK;

 void init_subprocess() {
--- a/python/jittor/src/utils/log.cc
+++ b/python/jittor/src/utils/log.cc
@ -9,6 +9,7 @@
 #include <iomanip>
 #include <thread>
 #include <unordered_map>
+#include <fstream>
 #include "utils/cross_platform.h"
 #include "utils/log.h"
 #include "utils/mwsr_list.h"
@ -368,6 +369,17 @@ void setter_log_vprefix(string value) {
    }
    vprefix_map = move(new_map);
 }
+DEFINE_FLAG_WITH_SETTER(string, log_file, "",
+    "log to file, mpi env will add $OMPI_COMM_WORLD_RANK suffix\n");
+void setter_log_file(string value) {
+    if (value.size() == 0)
+        return;
+    auto c = getenv("OMPI_COMM_WORLD_RANK");
+    if (c) value += string("_") + c;
+    static std::ofstream out;
+    out = std::ofstream(value);
+    std::cerr.rdbuf(out.rdbuf());
+}

 bool check_vlog(const char* fileline, int verbose) {
    uint64_t phash=0;
--- a/python/jittor/src/var_holder.h
+++ b/python/jittor/src/var_holder.h
@ -282,6 +282,33 @@ struct VarHolder {
     */
    // @pyjt(__get__grad)
    int grad();
+
+    // @pyjt(_input)
+    inline VarHolder* _input(int i) {
+        CHECK(!var->is_finished());
+        return new VarHolder(var->input()->input(i));
+    }
+
+    /* Add dependency, make var computed after vars
+    */
+    // @pyjt(_add_dependency)
+    // @attrs(return_self)
+    inline VarHolder* _add_dependency(vector<VarHolder*>&& vars) {
+        vector<Node*> b(vars.size());
+        for (int i=0; i<vars.size(); i++)
+            b[i] = vars[i]->var;
+        CHECK(!var->is_finished());
+        auto a = var->input();
+        var->input()->add_inputs(b);
+        auto edge = a->_inputs.end();
+        for (int i=0; i<b.size(); i++) {
+            edge = std::prev(edge);
+            // set -1 mean this is a control dependency edge
+            edge->back->index = -1;
+        }
+        return this;
+    }
+
 };

 // @pyjt(sync)
--- a/python/jittor/test/test_array.py
+++ b/python/jittor/test/test_array.py
@ -193,6 +193,23 @@ class TestArray(unittest.TestCase):
                assert str(c.dtype) == t
                np.testing.assert_allclose(a, c)

+    def test_scalar_fuse_unary(self):
+        with jt.profile_scope() as rep:
+            a = jt.array([1])
+            b = -a
+            a = a.clone()
+            b = b.clone()
+            jt.sync([a, b])
+            assert a.data == 1
+            assert b.data == -1
+        assert len(rep) == 2
+        
+    @unittest.skipIf(not jt.has_cuda, "Cuda not found")
+    def test_scalar_fuse_unary_cuda(self):
+        with jt.flag_scope(use_cuda=1):
+            self.test_scalar_fuse_unary()
+
+

 if __name__ == "__main__":
    unittest.main()
--- a/python/jittor/test/test_binary_op.py
+++ b/python/jittor/test/test_binary_op.py
@ -157,6 +157,12 @@ class TestBinaryOp(unittest.TestCase):
        c = a % b
        nc = a.data % b.data
        np.testing.assert_allclose(c.data, nc.data, atol=1e-5, rtol=1e-5)
+    
+    def test_pow(self):
+        # win cuda 10.2 cannot pass
+        a = jt.random((100,))
+        b = a**3
+        b.sync()



--- a/python/jittor/test/test_error_msg.py
+++ b/python/jittor/test/test_error_msg.py
@ -0,0 +1,71 @@
+# ***************************************************************
+# Copyright (c) 2021 Jittor. All Rights Reserved. 
+# Maintainers: 
+#    Dun Liang <randonlang@gmail.com>. 
+# 
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+import unittest
+import jittor as jt
+import numpy as np
+
+class TestErrorMsg(unittest.TestCase):
+
+    def test_error_msg(self):
+        a = jt.array([3,2,1])
+        b = jt.code(a.shape, a.dtype, [a],
+            cpu_header="""
+                #include <algorithm>
+                @alias(a, in0)
+                @alias(b, out)
+            """,
+            cpu_src="""
+                for (int i=0; i<a_shape0; i++)
+                    @b(i) = @a(i);
+                std::sort(&@b(0), &@b(in0_shape0));
+                throw std::runtime_error("???");
+            """
+        )
+        msg = ""
+        try:
+            print(b)
+        except Exception as e:
+            msg = str(e)
+        assert "[Reason]: ???" in msg
+        assert "[Input]: int32[3,]" in msg
+        assert "[OP TYPE]: code" in msg
+        assert "[Async Backtrace]:" in msg
+
+    @jt.flag_scope(trace_py_var=3)
+    def test_error_msg_trace_py_var(self):
+        a = jt.array([3,2,1])
+        b = jt.code(a.shape, a.dtype, [a],
+            cpu_header="""
+                #include <algorithm>
+                @alias(a, in0)
+                @alias(b, out)
+            """,
+            cpu_src="""
+                for (int i=0; i<a_shape0; i++)
+                    @b(i) = @a(i);
+                std::sort(&@b(0), &@b(in0_shape0));
+                throw std::runtime_error("???");
+            """
+        )
+        msg = ""
+        try:
+            print(b)
+        except Exception as e:
+            msg = str(e)
+        print(msg)
+        assert "[Reason]: ???" in msg
+        assert "[Input]: int32[3,]" in msg
+        assert "[OP TYPE]: code" in msg
+        assert "[Async Backtrace]:" in msg
+        assert "test_error_msg.py:" in msg
+
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/python/jittor/test/test_init.py
+++ b/python/jittor/test/test_init.py
@ -55,5 +55,42 @@ class TestInit(unittest.TestCase):
    def test_resnet(self):
        check(models.resnet152(), torchvision.models.resnet152(), rtol=5e-2, mean_atol=1e-2)

+from jittor import init
+from jittor import nn
+
+class TestInitFunc(unittest.TestCase):
+    def test_eye(self):
+        a = init.eye(2, "float32")
+        np.testing.assert_allclose(a.data, [[1,0],[0,1]])
+        a = init.eye((2,3), "float32")
+        np.testing.assert_allclose(a.data, [[1,0,0],[0,1,0]])
+
+        linear = nn.Linear(2,2)
+        init.eye_(linear.weight)
+        np.testing.assert_allclose(linear.weight.data, [[1,0],[0,1]])
+
+    def test_constant(self):
+        a = init.constant(2, "float32")
+        np.testing.assert_allclose(a.data, [0,0])
+        a = init.constant((2,3), value=1.)
+        np.testing.assert_allclose(a.data, [[1,1,1],[1,1,1]])
+
+        linear = nn.Linear(2,2)
+        init.constant_(linear.weight)
+        np.testing.assert_allclose(linear.weight.data, [[0,0],[0,0]])
+
+    def test_uniform(self):
+        a = init.uniform(5, "float32")
+        assert ((a>0) & (a<1)).all()
+        a = init.uniform((2,3), low=-1, high=1)
+        assert ((a>-1) & (a<1)).all()
+
+        linear = nn.Linear(2,2)
+        init.uniform_(linear.weight)
+        assert (linear.weight > 0).all()
+        linear.weight.uniform_()
+        assert (linear.weight > 0).all()
+
+
 if __name__ == "__main__":
    unittest.main()
--- a/python/jittor/test/test_resnet.py
+++ b/python/jittor/test/test_resnet.py
@ -72,31 +72,31 @@ class TestResnet(unittest.TestCase):
            epoch_id = self.train_loader.epoch_id

            # train step
-            with jt.log_capture_scope(
-                log_silent=1,
-                log_v=1, log_vprefix="op.cc=100,exe=10",
-            ) as logs:
-                output = mnist_net(data)
-                loss = nn.cross_entropy_loss(output, target)
-                SGD.step(loss)
-                def callback(epoch_id, batch_id, loss, output, target):
-                    # print train info
-                    global prev
-                    pred = np.argmax(output, axis=1)
-                    acc = np.mean(target==pred)
-                    loss_list.append(loss[0])
-                    acc_list.append(acc)
-                    print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tAcc: {:.6f} \tTime:{:.3f}'
-                        .format(epoch_id, batch_id, 600,1. * batch_id / 6.0, loss[0], acc, time.time()-prev))
-                    # prev = time.time()
-                jt.fetch(epoch_id, batch_id, loss, output, target, callback)
+            # with jt.log_capture_scope(
+            #     log_silent=1,
+            #     log_v=1, log_vprefix="op.cc=100,exe=10",
+            # ) as logs:
+            output = mnist_net(data)
+            loss = nn.cross_entropy_loss(output, target)
+            SGD.step(loss)
+            def callback(epoch_id, batch_id, loss, output, target):
+                # print train info
+                global prev
+                pred = np.argmax(output, axis=1)
+                acc = np.mean(target==pred)
+                loss_list.append(loss[0])
+                acc_list.append(acc)
+                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\tAcc: {:.6f} \tTime:{:.3f}'
+                    .format(epoch_id, batch_id, 600,1. * batch_id / 6.0, loss[0], acc, time.time()-prev))
+                # prev = time.time()
+            jt.fetch(epoch_id, batch_id, loss, output, target, callback)
            
-            log_conv = find_log_with_re(logs, 
-                "Jit op key (not )?found: ((mkl)|(cudnn))_conv.*")
-            log_matmul = find_log_with_re(logs, 
-                "Jit op key (not )?found: ((mkl)|(cublas))_matmul.*")
-            if batch_id > 2:
-                assert len(log_conv)==59 and len(log_matmul)==6, (len(log_conv), len(log_matmul))
+            # log_conv = find_log_with_re(logs, 
+            #     "Jit op key (not )?found: ((mkl)|(cudnn))_conv.*")
+            # log_matmul = find_log_with_re(logs, 
+            #     "Jit op key (not )?found: ((mkl)|(cublas))_matmul.*")
+            # if batch_id > 2:
+            #     assert len(log_conv)==59 and len(log_matmul)==6, (len(log_conv), len(log_matmul))

            mem_used = jt.flags.stat_allocator_total_alloc_byte \
                -jt.flags.stat_allocator_total_free_byte
--- a/python/jittor/test/test_transpose_op.py
+++ b/python/jittor/test/test_transpose_op.py
@ -68,6 +68,10 @@ class TestTransposeOp(unittest.TestCase):
        assert a.permute().shape == [4,3,2]
        assert a.permute(0,2,1).shape == [2,4,3]

+    def test_transpose_3d2i(self):
+        a = jt.ones([2,3,4])
+        assert a.transpose(0,1).shape == (3,2,4)
+
    @unittest.skipIf(not jt.compiler.has_cuda, "No CUDA found")
    @jt.flag_scope(use_cuda=1)
    def test_cutt(self):
--- a/python/jittor/test/test_unary_op.py
+++ b/python/jittor/test/test_unary_op.py
@ -76,6 +76,25 @@ class TestUnaryOp(unittest.TestCase):
        da = jt.grad(b, a)
        assert (da.data == 1).all()

+    def test_erfinv(self):
+        from scipy import special
+        y = np.linspace(-1.0, 1.0, num=10)
+        x = special.erfinv(y)
+        y2 = jt.array(y)
+        x2 = jt.erfinv(y2)
+        np.testing.assert_allclose(y.data, y2.data)
+        
+        
+        y = np.linspace(-0.9, 0.9, num=10)
+        x = special.erfinv(y)
+        y2 = jt.array(y)
+        x2 = jt.erfinv(y2)
+        np.testing.assert_allclose(y.data, y2.data)
+        d = jt.grad(x2, y2)
+        _, (dn,) = ngrad(lambda y: special.erfinv(y).sum(), [y], 1e-8)
+        np.testing.assert_allclose(d.data, dn, atol=1e-6, rtol=1e-6)
+
+
 class TestUnaryOpCuda(TestUnaryOp, test_cuda(2)):
    pass

--- a/python/jittor/utils/gen_pyi.py
+++ b/python/jittor/utils/gen_pyi.py
@ -0,0 +1,250 @@
+# ***************************************************************
+# Copyright (c) 2021 Jittor. All Rights Reserved.
+# Maintainers:
+#     Zheng-Ning Liu <lzhengning@com>
+#
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+
+""" This file implements generation of stub files for Jittor C extensions.
+
+In detail, autocompletion of the following functions are supported.
+- functions in __init__.py
+- functions in jittor.core.ops
+- attributes of jittor.flags
+- methods of jittor.Var
+
+Prerequisite:
+- mypy for automatic stub generation
+
+Usage: python3 -m jittor.utils.gen_pyi
+
+"""
+
+import os
+import re
+import shutil
+import jittor
+
+def add_indent(s: str, n=1):
+    for _ in range(n):
+        s = '\t' + s.replace('\n', '\n\t', s.count('\n')-1)
+    return s
+
+def ctype_to_python(type_str):
+    if type_str == "bool":
+        return "bool"
+    if type_str in ["int", "uint", "int64", "uint64", "size_t"]:
+        return "int"
+    if type_str in ["float32", "float64"]:
+        return "float"
+    if type_str in ["string", "string&&", "NanoString", "char*", "const char*"]:
+        return "str"
+    if type_str in ["vector<int>"]:
+        return "List[int]"
+    if type_str in ["vector<string>&&", "vector<NanoString>&&"]:
+        return "List[str]"
+    if type_str == "VarHolder*":
+        return "Var"
+    if type_str in ["vector<VarHolder*>", "vector<VarHolder*>&&"]:
+        return "List[Var]"
+    if type_str == "NanoVector":
+        return "Tuple[int]"
+    if type_str == "vector<NanoVector>&&":
+        return "List[Tuple[int]]"
+    if type_str in ["FetchFunc", "FetchFunc&&", "NumpyFunc&&"]:
+        return "Callable"
+    if type_str == "vector<NumpyFunc>&&":
+        return "List[Callable]"
+    if type_str == "PyObject*":
+        return "float | int | numpy.ndarray | Var"
+    if type_str == "VarSlices&&":
+        return "slice"
+    if type_str in ["ArrayArgs", "ArrayArgs&&", "DataView"]:
+        return "numpy.ndarray"
+    if type_str == 'ItemData':
+        return "float | int | bool"
+    if type_str == "void":
+        return ""
+    print(f"[warning] Unknown ctype: {type_str}, do not write type hinting")
+    return ""
+
+def cval_to_python(val_str: str):
+    if val_str == "false":
+        return "False"
+    if val_str == "true":
+        return "True"
+    if val_str.startswith("ns_"):
+        return f'"{val_str[3:]}"'
+    if val_str == "NanoVector()":
+        return "()"
+    return val_str
+
+
+def run_stubgen(jittor_path, cache_path):
+
+    # for __init__.py functions
+    stubpath = os.path.join(cache_path, 'stubs')
+    stubfile = os.path.join(stubpath, "jittor", "__init__.pyi")
+    os.system(f"stubgen -m jittor -o {stubpath} -q")
+    with open(stubfile) as f:
+        mypy_content = f.read()
+
+    f = open(stubfile, "w")
+    # Remove the follow type redirection
+    unused_content = ["ori_int = int\n",
+                      "ori_float = float\n",
+                      "ori_bool = bool\n",
+                      "int = int32\n",
+                      "float = float32\n",
+                      "double = float64\n",
+                      "\nflags: Any\n"]
+    for unused in unused_content:
+        mypy_content = mypy_content.replace(unused, "")
+    f.write(mypy_content)
+
+    shutil.move(stubfile, os.path.join(jittor_path, "__init__.pyi"))
+    shutil.rmtree(stubpath)
+    shutil.rmtree(os.path.expanduser(".mypy_cache"))
+
+def gen_ops_stub(jittor_path):
+    f = open(os.path.join(jittor_path, "__init__.pyi"), "a")
+    f.write("from typing import List, Tuple, Callable, overload\n")
+    f.write("import numpy\n")
+
+    var_hint = "class Var:\n\t'''Variable that stores multi-dimensional data.'''\n"
+    var_methods = set()
+
+    def decl_to_param_hints(decl):
+        param_decl = re.findall(r".+ [a-zA-Z_0-9]+\((.*)\)", decl)[0]
+        if not param_decl.strip():
+            return []
+        param_hints = []
+        for param_str in param_decl.split(','):
+            if "=" in param_str:
+                template = r"\s*(.+)\s+([a-zA-Z_0-9]+)\s*=\s*(.+)"
+                param_type, param_name, param_val = re.findall(template, param_str)[0]
+                param_type = ctype_to_python(param_type)
+                param_val = cval_to_python(param_val)
+            else:
+                param_type, param_name = param_str.strip().rsplit(' ', maxsplit=1)
+                param_type = ctype_to_python(param_type)
+                param_val = ""
+
+            hint = param_name
+            if param_type:
+                hint += ": " + param_type
+            if param_val:
+                hint += "=" + param_val
+            param_hints.append(hint)
+        return param_hints
+
+    def generate_var_hint(decorators, return_type, param_hints, docstring):
+        hint = add_indent(decorators) if decorators else ""
+        hint += f"\tdef {func_name}("
+        hint += ", ".join(['self'] + param_hints) + ")"
+        hint += f"-> {return_type}" if return_type else ""
+        hint += ":"
+        if docstring:
+            hint += add_indent(f"\n'''{docstring}'''\n", 2) + "\t\t...\n"
+        else:
+            hint += f" ...\n"
+        return hint
+
+    for func_name, func in jittor.ops.__dict__.items():
+        if func_name.startswith("__"):
+            continue
+        # Exclude a function that overrides the builtin bool:
+        #       def bool(x: Var) -> Var: ...
+        # It will confuse the IDE. So we ignore this function in pyi.
+        if func_name == "bool":
+            continue
+
+        docstring = func.__doc__[:func.__doc__.find("Declaration:")]
+        docstring = docstring.replace("'''", '"""').strip()
+        declarations = re.findall(r"Declaration:\n(.+)\n", func.__doc__)
+
+        for decl in declarations:
+            decorators = "@overload\n" if len(declarations) > 1 else ""
+            return_type = ctype_to_python(decl.split(' ', maxsplit=1)[0])
+            param_hints = decl_to_param_hints(decl)
+
+            func_text = decorators
+            func_text += f"def {func_name}"
+            func_text += "(" + ", ".join(param_hints) + ")"
+            func_text += f"-> {return_type}" if return_type else ""
+            func_text += ":\n"
+            if docstring:
+                func_text += add_indent(f"'''{docstring}'''\n") + "\t...\n"
+            else:
+                func_text += f" ...\n"
+
+            f.write(func_text)
+
+            if not "Var" in param_hints[0]:
+                continue
+            var_methods.add(func_name)
+            var_hint += generate_var_hint(decorators, return_type, param_hints[1:], docstring)
+
+    for func_name, func in jittor.Var.__dict__.items():
+        if func_name.startswith("__") or func_name in var_methods:
+            continue
+        if func_name in ["int", "float", "double", "bool", "long"]:
+            continue
+        if func.__doc__ is None:
+            continue
+        docstring = func.__doc__[:func.__doc__.find("Declaration:")]
+        docstring = docstring.replace("'''", '"""').strip()
+        declarations = re.findall(r"Declaration:\n(.+)\n", func.__doc__)
+
+        for decl in declarations:
+            decl = decl.replace("inline ", "")
+            decorators = "@overload\n" if len(declarations) > 1 else ""
+            return_type = re.findall(r"(.+) [a-zA-Z_0-9]+\(.*\)", decl)[0].split()[-1]
+            return_type = ctype_to_python(return_type)
+            param_hints = decl_to_param_hints(decl)
+
+            var_hint += generate_var_hint(decorators, return_type, param_hints, docstring)
+
+    f.write(var_hint)
+    f.close()
+
+def gen_flags_stub(jittor_path):
+    f = open(os.path.join(jittor_path, "__init__.pyi"), "a")
+    f.write("class Flags:\n")
+    f.write("\t'''A set of flags to configure jittor running behaviors'''\n")
+
+    for attr_name, attr in jittor.Flags.__dict__.items():
+        if attr_name.startswith("__"):
+            continue
+        docstring = attr.__doc__
+        docstring = attr.__doc__[:attr.__doc__.find("Declaration:")]
+        docbody = re.findall("\(type.+default.+\):(.+)", docstring)[0].strip()
+        docbody += "." if not docbody.endswith('.') else ""
+        attr_type, attr_val = re.findall(r"\(type:(.+), default:(.+)\)", docstring)[0]
+        attr_type = ctype_to_python(attr_type)
+        attr_type = attr_type if attr_type else "Any"
+        f.write(f"\t{attr_name}: {attr_type}\n")
+        f.write(f"\t'''{docbody} Default: {attr_val}'''\n")
+
+    f.write("flags: Flags\n")
+    f.write("'''Jittor running time flags instance'''\n")
+    f.close()
+
+def get_pyi(jittor_path=None, cache_path=None):
+    if jittor_path is None:
+        jittor_path = jittor.flags.jittor_path
+    if cache_path is None:
+        import jittor_utils
+        cache_path = jittor_utils.cache_path
+
+    run_stubgen(jittor_path, cache_path)
+    gen_ops_stub(jittor_path)
+    gen_flags_stub(jittor_path)
+
+    print(f"Generated stubfile: {os.path.join(jittor_path, '__init__.pyi')}")
+
+
+if __name__ == "__main__":
+    get_pyi()
--- a/python/jittor_utils/init.py
+++ b/python/jittor_utils/init.py
@ -23,7 +23,7 @@ import urllib.request
 if platform.system() == 'Darwin':
    mp.set_start_method('fork')

-class LogWarper:
+class Logwrapper:
    def __init__(self):
        self.log_silent = int(os.environ.get("log_silent", "0"))
        self.log_v = int(os.environ.get("log_v", "0"))
@ -237,12 +237,76 @@ def download(url, filename):
    urllib.request.urlretrieve(url, filename)
    LOG.v("Download finished")

+def get_jittor_version():
+    path = os.path.dirname(__file__)
+    with open(os.path.join(path, "../jittor/__init__.py"), "r", encoding='utf8') as fh:
+        for line in fh:
+            if line.startswith('__version__'):
+                version = line.split("'")[1]
+                break
+        else:
+            raise RuntimeError("Unable to find version string.")
+    return version
+
+def get_str_hash(s):
+    import hashlib
+    md5 = hashlib.md5()
+    md5.update(s.encode())
+    return md5.hexdigest()
+
+def get_cpu_version():
+    v = platform.processor()
+    try:
+        if os.name == 'nt':
+            import winreg
+            key_name = r"Hardware\Description\System\CentralProcessor\0"
+            field_name = "ProcessorNameString"
+            key = winreg.OpenKey(winreg.HKEY_LOCAL_MACHINE, key_name)
+            value = winreg.QueryValueEx(key, field_name)[0]
+            winreg.CloseKey(key)
+            v = value
+        elif platform.system() == "Darwin":
+            r, s = sp.getstatusoutput("sysctl -a sysctl machdep.cpu.brand_string")
+            if r==0:
+                v = s.split(":")[-1].strip()
+        else:
+            with open("/proc/cpuinfo", 'r') as f:
+                for l in f:
+                    if l.startswith("model name"):
+                        v = l.split(':')[-1].strip()
+                        break
+    except:
+        pass
+    return v
+    
+def short(s):
+    ss = ""
+    for c in s:
+        if str.isidentifier(c) or str.isnumeric(c) \
+            or str.isalpha(c) or c in '.-+':
+            ss += c
+    if len(ss)>14:
+        return ss[:14]+'x'+get_str_hash(ss)[:2]
+    return ss
+
 def find_cache_path():
    from pathlib import Path
    path = str(Path.home())
-    dirs = [".cache", "jittor", os.path.basename(cc_path)]
-    if os.environ.get("debug")=="1":
-        dirs[-1] += "_debug"
+    # jittor version key
+    jtv = "jt"+get_jittor_version().rsplit('.', 1)[0]
+    # cc version key
+    ccv = cc_type+get_version(cc_path)[1:-1] \
+        if cc_type != "cl" else cc_type
+    # os version key
+    osv = platform.platform() + platform.node()
+    if len(osv)>14:
+        osv = osv[:14] + 'x'+get_str_hash(osv)[:2]
+    # py version
+    pyv = "py"+platform.python_version()
+    # cpu version
+    cpuv = get_cpu_version()
+    dirs = [".cache", "jittor", jtv, ccv, pyv, osv, cpuv]
+    dirs = list(map(short, dirs))
    cache_name = "default"
    try:
        if "cache_name" in os.environ:
@ -260,18 +324,14 @@ def find_cache_path():
        for c in " (){}": cache_name = cache_name.replace(c, "_")
    except:
        pass
+    if os.environ.get("debug")=="1":
+        dirs[-1] += "_debug"
    for name in os.path.normpath(cache_name).split(os.path.sep):
-        dirs.insert(-1, name)
+        dirs.append(name)
    os.environ["cache_name"] = cache_name
    LOG.v("cache_name: ", cache_name)
-    for d in dirs:
-        path = os.path.join(path, d)
-        if not os.path.isdir(path):
-            try:
-                os.mkdir(path)
-            except:
-                pass
-        assert os.path.isdir(path)
+    path = os.path.join(path, *dirs)
+    os.makedirs(path, exist_ok=True)
    if path not in sys.path:
        sys.path.append(path)
    return path
@ -422,7 +482,7 @@ def get_total_mem():

 is_in_ipynb = in_ipynb()
 cc = None
-LOG = LogWarper()
+LOG = Logwrapper()

 check_msvc_install = False
 msvc_path = ""
--- a/python/jittor_utils/class/motd
+++ b/python/jittor_utils/class/motd
@ -0,0 +1,20 @@
+★★★★★★★★★★★★★★★★★★★★★
+Welcome to use Jittor
+Please put the file under /root directory
+★★★★★★★★★★★★★★★★★★★★★
+欢迎使用Jittor
+请将文件放置在/root目录下
+本docker已经安装好cuda环境
+相关链接：
+*  [Jittor官网](https://cg.cs.tsinghua.edu.cn/jittor/)
+*  [Jittor教程](https://cg.cs.tsinghua.edu.cn/jittor/tutorial/)
+*  [Jittor模型库](https://cg.cs.tsinghua.edu.cn/jittor/resources/)
+*  [Jittor文档](https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/index.html)
+*  [Github](https://github.com/jittor/jittor)， [Gitee](https://gitee.com/jittor/jittor)
+*  [Jittor 论坛](https://discuss.jittor.org/)
+*  即时通信: QQ Group(761222083)
+
+欢迎大家star，fork并在QQ群或者论坛向我们提出宝贵的意见和建议。
+
+注意:请不要开启无密码保护的jupyter notebook或vscode server
+★★★★★★★★★★★★★★★★★★★★★
--- a/python/jittor_utils/class/setup.py
+++ b/python/jittor_utils/class/setup.py
@ -0,0 +1,16 @@
+import sys
+import os
+command = sys.argv[1]
+if (command == 'ssh'):
+    port = sys.argv[2]
+    data = open("/etc/ssh/sshd_config", "r").readlines()
+    data[12] = 'Port ' + port + '\nPermitRootLogin yes\n' 
+    f = open("/etc/ssh/sshd_config", "w")
+    f.writelines(data)
+    f.close()
+    os.system("service ssh restart")
+elif (command == 'passwd'):
+    passwd = sys.argv[2]
+    os.system("echo root:"+passwd+" | chpasswd")
+else:
+    print('command error')
--- a/python/jittor_utils/class/setup_env.py
+++ b/python/jittor_utils/class/setup_env.py
@ -0,0 +1,137 @@
+# ***************************************************************
+# Copyright (c) 2021 Jittor. All Rights Reserved. 
+# Maintainers:
+#     Guoye Yang <498731903@qq.com>
+#     Dun Liang <randonlang@gmail.com>.
+#
+# 
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+
+'''
+example:
+
+export class_home=/mnt/disk/cjld/class_nn
+mkdir -p $class_home
+docker pull jittor/jittor-cuda
+python3.7 -m jittor_utils.class.setup_env setup 4
+python3.7 -m jittor_utils.class.setup_env start 4
+python3.7 -m jittor_utils.class.setup_env report
+python3.7 -m jittor_utils.class.setup_env restart 4
+python3.7 -m jittor_utils.class.setup_env stop
+'''
+# export class_home
+# setup [n]          // setup for n users. including build user paths, user_info.txt and docker imgs. !!!WILL RESET SUDENT_FILES!!!
+# start [n_gpu]      // run n docker CONTAINERs with n_gpu GPUs.
+# stop               // stop n docker CONTAINERs
+# restart [n_gpu]    // restart n docker CONTAINERs with n_gpu GPUs.
+import sys
+import os 
+import json as js
+import random
+
+class_home = os.environ["class_home"]
+student_files_dir = class_home + "/student_files"
+student_files_bk_dir = class_home + "/student_files_bak"
+cwd = os.path.dirname(__file__)
+
+def run_cmd(cmd):
+    print("[CMD]:", cmd)
+    ret = os.system(cmd)
+    if ret:
+        print("[CMD] return", ret)
+    return ret
+
+def generate_random_str(randomlength):
+  random_str = ''
+  base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
+  length = len(base_str) - 1
+  for i in range(randomlength):
+    random_str += base_str[random.randint(0, length)]
+  return random_str
+
+def setup(n):
+    if os.path.exists(student_files_dir):
+        if os.path.exists(student_files_bk_dir):
+            run_cmd(f"rm -rf {student_files_bk_dir}")
+        run_cmd(f"mv {student_files_dir} {student_files_bk_dir}")
+    os.makedirs(student_files_dir)
+    user_info = []
+    for i in range(n): # 0 for root
+        port = 20000 + i
+        passwd = generate_random_str(8)
+        name = 'stu_'+str(i)
+        path = os.path.abspath(os.path.join(student_files_dir, name))
+        info = {'port': port,
+                'passwd': passwd,
+                'name': name,
+                'path': path}
+        user_info.append(info)
+        student_files_src = class_home + "/student_files_src"
+        if os.path.isdir(student_files_src):
+            run_cmd(f"cp -r {student_files_src} {path}")
+        else:
+            run_cmd('mkdir -p ' + path)
+    js.dump(user_info, open(student_files_dir+"/user_info.json", "w"))
+
+def start(n):
+    assert os.path.exists(student_files_dir+'/user_info.json')
+    user_info = js.load(open(student_files_dir+'/user_info.json', 'r'))
+    for i in range(len(user_info)):
+        id = i % n
+        u = user_info[i]
+        print('START', i, '/', len(user_info))
+        assert 0 == run_cmd(f'docker run -itd --shm-size=8g --network host --name {u["name"]} -v {u["path"]}:/root --gpus "device={id}" jittor/jittor-cuda bash')
+        # assert 0 == run_cmd(f'docker exec -it {u["name"]} bash -c \'apt update && apt install openssh-server -y\'')
+        assert 0 == run_cmd(f'docker cp {cwd}/setup.py {u["name"]}:/etc/ssh/setup.py')
+        assert 0 == run_cmd(f'docker cp {cwd}/motd {u["name"]}:/etc/motd')
+        assert 0 == run_cmd(f'docker exec -it {u["name"]} python3.7 /etc/ssh/setup.py passwd {u["passwd"]}')
+        assert 0 == run_cmd(f'docker exec -it {u["name"]} python3.7 /etc/ssh/setup.py ssh {u["port"]}')
+        assert 0 == run_cmd(f'docker exec -it {u["name"]} python3.7 -m pip install jittor -U')
+        assert 0 == run_cmd(f'docker exec -it {u["name"]} python3.7 -m jittor.test.test_core')
+
+def stop():
+    assert os.path.exists(student_files_dir+'/user_info.json')
+    user_info = js.load(open(student_files_dir+'/user_info.json', 'r'))
+    for i in range(len(user_info)):
+        u = user_info[i]
+        print('STOP', i, '/', len(user_info))
+        run_cmd(f'docker rm -f {u["name"]}')
+
+def report():
+    assert os.path.exists(student_files_dir+'/user_info.json')
+    user_info = js.load(open(student_files_dir+'/user_info.json', 'r'))
+    hostname = open("/etc/hostname", 'r').read().strip() + ".randonl.me"
+    for i in range(len(user_info)):
+        u = user_info[i]
+        print(f"ssh -p {u['port']} root@{hostname} # passwd: {u['passwd']}")
+
+def restart(n):
+    stop()
+    start(n)
+
+args = sys.argv[1:]
+if (args[0] == 'setup'):
+    assert(len(args) == 2)
+    assert(type(eval(args[1])) == int)
+    n = int(args[1])
+    assert(n < 999)
+    setup(n)
+elif (args[0] == 'start'):
+    assert(len(args) == 2)
+    assert(type(eval(args[1])) == int)
+    n = int(args[1])
+    start(n)
+elif (args[0] == 'stop'):
+    stop()
+elif (args[0] == 'restart'):
+    assert(len(args) == 2)
+    assert(type(eval(args[1])) == int)
+    n = int(args[1])
+    restart(n)
+elif (args[0] == 'report'):
+    report()
+else:
+    assert(False)
+
--- a/python/jittor_utils/clean_cache.py
+++ b/python/jittor_utils/clean_cache.py
@ -0,0 +1,55 @@
+# ***************************************************************
+# Copyright (c) 2021 Jittor. All Rights Reserved. 
+# Maintainers: Dun Liang <randonlang@gmail.com>. 
+# This file is subject to the terms and conditions defined in
+# file 'LICENSE.txt', which is part of this source code package.
+# ***************************************************************
+import os, sys, shutil
+from pathlib import Path
+import glob
+
+cache_path = os.path.join(str(Path.home()), ".cache", "jittor")
+
+def callback(func, path, exc_info):
+    print(f"remove \"{path}\" failed.")
+
+def rmtree(path):
+    if os.path.isdir(path):
+        print(f"remove \"{path}\" recursive.")
+        shutil.rmtree(path, onerror=callback)
+
+def clean_all():
+    rmtree(cache_path)
+
+def clean_core():
+    rmtree(cache_path+"/default")
+    rmtree(cache_path+"/master")
+    fs = glob.glob(cache_path+"/jt*")
+    for f in fs: rmtree(f)
+
+def clean_cuda():
+    rmtree(cache_path+"/jtcuda")
+    rmtree(cache_path+"/cutt")
+    rmtree(cache_path+"/cub")
+    rmtree(cache_path+"/nccl")
+
+def clean_dataset():
+    rmtree(cache_path+"/dataset")
+
+def print_help():
+    msg = "|".join(keys)
+    print(f"Usage: {sys.executable} -m jittor_utils.clean_cache [{msg}]")
+    exit()
+
+
+keys = [ k[6:] for k in globals() if k.startswith("clean_") ]
+
+if __name__ == "__main__":
+    if len(sys.argv)==1:
+        print_help()
+    else:
+        for k in sys.argv[1:]:
+            if k not in keys:
+                print_help()
+            func = globals()["clean_"+k]
+            func()
--- a/python/jittor_utils/install_cuda.py
+++ b/python/jittor_utils/install_cuda.py
@ -57,6 +57,12 @@ def install_cuda():
        if cuda_driver_version >= [11,4]:
            cuda_tgz = "cuda11.4_cudnn8_win.zip"
            md5 = "06eed370d0d44bb2cc57809343911187"
+        elif cuda_driver_version >= [11,2]:
+            cuda_tgz = "cuda11.2_cudnn8_win.zip"
+            md5 = "b5543822c21bc460c1a414af47754556"
+        elif cuda_driver_version >= [11,]:
+            cuda_tgz = "cuda11.0_cudnn8_win.zip"
+            md5 = "7a248df76ee5e79623236b0560f8d1fd"
        elif cuda_driver_version >= [10,]:
            cuda_tgz = "cuda10.2_cudnn7_win.zip"
            md5 = "7dd9963833a91371299a2ba58779dd71"
--- a/python/jittor_utils/lock.py
+++ b/python/jittor_utils/lock.py
@ -2,9 +2,14 @@ try:
    import fcntl
 except ImportError:
    fcntl = None
-    import win32file
-    import pywintypes
-    _OVERLAPPED = pywintypes.OVERLAPPED()
+    try:
+        import win32file
+        import pywintypes
+        _OVERLAPPED = pywintypes.OVERLAPPED()
+    except:
+        LOG.f("""pywin32 package not found, please install it.
+If conda is used, please install with command: 
+>>> conda install pywin32""")

 import os
 from jittor_utils import cache_path, LOG
--- a/python/jittor_utils/misc.py
+++ b/python/jittor_utils/misc.py
@ -47,10 +47,17 @@ def download_url_to_local(url, filename, root_folder, md5):
        return
    else:
        print('Downloading ' + url + ' to ' + file_path)
-        urllib.request.urlretrieve(
-            url, file_path,
-            reporthook=_progress()
-        )
+        try:
+            urllib.request.urlretrieve(
+                url, file_path,
+                reporthook=_progress()
+            )
+        except Exception as e:
+            msg = f"{e}\nDownload File failed, url: {url}, path: {file_path}"
+            print(msg)
+            if os.path.isfile(file_path):
+                os.remove(file_path)
+            raise RuntimeError(msg)
    if not check_file_exist(file_path, md5):
        raise RuntimeError("File downloads failed.")

--- a/python/jittor_utils/pip_publish.py
+++ b/python/jittor_utils/pip_publish.py
@ -0,0 +1,34 @@
+import os
+import glob
+import shutil
+import sys
+
+home_path = os.path.join(os.path.dirname(__file__), "..", "..")
+home_path = os.path.abspath(home_path)
+
+def callback(func, path, exc_info):
+    print(f"remove \"{path}\" failed.")
+
+def rmtree(path):
+    if os.path.isdir(path):
+        print(f"remove \"{path}\" recursive.")
+        shutil.rmtree(path, onerror=callback)
+
+def remove_tmpfile():
+    dist_file = home_path+"/dist"
+    egg_file = glob.glob(home_path+"/**/*egg-info")
+    rmtree(dist_file)
+    for e in egg_file:
+        rmtree(e)
+
+def run_cmd(cmd):
+    print("[CMD]", cmd)
+    assert os.system(cmd)==0
+
+os.chdir(home_path)
+remove_tmpfile()
+
+run_cmd(f"{sys.executable} ./setup.py sdist")
+run_cmd(f"{sys.executable} -m twine upload dist/*")
+
+remove_tmpfile()
--- a/python/jittor_utils/query_cuda_cc.py
+++ b/python/jittor_utils/query_cuda_cc.py
@ -0,0 +1,25 @@
+import ctypes
+import os
+if "CUDA_VISIBLE_DEVICES" in os.environ:
+    del os.environ["CUDA_VISIBLE_DEVICES"]
+if os.name == 'nt':
+    cuda_driver = ctypes.CDLL("nvcuda")
+else:
+    cuda_driver = ctypes.CDLL("libcuda.so")
+driver_version = ctypes.c_int()
+r = cuda_driver.cuDriverGetVersion(ctypes.byref(driver_version))
+assert r == 0
+v = driver_version.value
+
+dcount = ctypes.c_int()
+cuda_driver.cuInit(0)
+r = cuda_driver.cuDeviceGetCount(ctypes.byref(dcount))
+
+for i in range(dcount.value):
+    dev = ctypes.c_void_p()
+    major = ctypes.c_int()
+    minor = ctypes.c_int()
+    assert 0 == cuda_driver.cuDeviceGet(ctypes.byref(dev), i)
+    assert 0 == cuda_driver.cuDeviceGetAttribute(ctypes.byref(major), 75, dev)
+    assert 0 == cuda_driver.cuDeviceGetAttribute(ctypes.byref(minor), 76, dev)
+    print(major.value*10+minor.value)
--- a/setup.py
+++ b/setup.py
@ -58,7 +58,7 @@ setuptools.setup(
    python_requires='>=3.7',

    packages=["jittor", "jittor.test", "jittor.models", "jittor.utils", "jittor_utils"],
-    package_dir={'': os.path.join(path, 'python')},
+    package_dir={'': 'python'},
    package_data={'': ['*', '*/*', '*/*/*','*/*/*/*','*/*/*/*/*','*/*/*/*/*/*']},
    # include_package_data=True,
    install_requires=[