add batch

This commit is contained in:
Firmlyzhu 2019-05-26 18:43:36 +08:00
parent a3c7b2e315
commit 7074e0144c
18 changed files with 145 additions and 1 deletions

View File

@ -14,6 +14,7 @@
* [Dashboard](portal/dashboard.md)
* [Config](portal/config.md)
* [Status](portal/status.md)
* [Batch](portal/batch.md)
* [Admin](portal/admin.md)
* [Hosts](portal/hosts.md)
* [Users](portal/users.md)
@ -39,3 +40,8 @@
* [What is 'Beans'](billing/beans.md)
* [Billing system](billing/billing.md)
* [How to get beans](billing/getBeans.md)
* [Batch Computing](batch/README.md)
* [Concepts](batch/concepts.md)
* [Batch Jobs Creation and Configuration](batch/create.md)
* [Job Statuses and Scheduling](batch/status_schedule.md)
* [Billing](batch/billing.md)

View File

@ -0,0 +1,12 @@
# Batch Computing #
BatchCompute is a distributed cloud service for parallel batch jobs. It can support batch jobs with directed acyclic graphs(DAG). Batch computing system will automatically completes resource management, job scheduling and data loading, and charges according to actual usage.
The system entry is **batch**, which is on the left of the dashboard. The introduction of batch index page is in [Batch](portal/batch.md).
This page introduces the batch computing system of docklet, it will contain 4 parts:
* [Concepts](concepts.md)
* [Batch jobs creation and configuration](create.md)
* [Job statuses and scheduling](status_schedule.md)
* [Billing](billing.md)

View File

@ -0,0 +1,22 @@
# Billing #
On the **Info** page of each job, you can view the billing for each task, as shown below:
<img src='../images/batch_billing.jpg'>
Billing is performed when each task reaches the termination state (failed, finished, and stopped), and the number of "beans" owned by the user will be deducted. The total cost of a job is a simple accumulation of the cost of the task.
For each running instance of each task on one vnode, the billing formula is:
<img src='../images/batch_formula.png'>
Among them, B is the number of beans spent, Ceil means taking the ceiling, Ncpu is the number of CPU cores configured by the task, Nmem and Ndisk are the memory size and disk size (in GB) configured by the task respectively, and Ngpu is the number of configured gpus, and T is the total running time of the task (in seconds). Pcpu, Pmem, Pdisk, and Pgpu are the prices of each resource.
In current versionthe prices are
* Pcpu = 1/3600 /(core*s)
* Pmem = 1/3600 /(GB*s)
* Pdisk = 1/3600 /(GB*s)
* Pgpu = 100/3600 /(num*s)
The cost of a single task is the cumulative cost of an running instance on each vnode.

View File

@ -0,0 +1,7 @@
# Concepts #
* Job: A job can contain multiple tasks, which is the basic unit of system management and the basic unit for user creation and deletion.
* TaskA task is the basic unit of system scheduling. Queuing and execution are performed on a task-by-task basis. A task runs on a virtual cluster (vcuster), and the nodes of the cluster are called virtual nodes (vnodes).
* VnodeA task can run on multiple vnodes, and a vnode corresponds to a container runtime that encapsulates the runtime environment and resources. All vnodes of a task have the same resource configuration and mirroring.

View File

@ -0,0 +1,42 @@
# Batch Jobs Creation and Configuration #
On the job creation page, the first three fields are the job name, the cluster location that jobs are submitted to and the priority:
<img src='../images/batch_create1.jpg'>
Note
* The name does not need to be unique. It is recommended to take a name that describes the job information.
* The cluster location is the cluster location where the job is actually executed. Image and data between clusters cannot be shared.
* The priority is the priority of the scheduling, 0 is the lowest priority, and 9 is the highest. The higher the priority, the more possible the job will be executed.
After that is configuration form of each task:
<img src='../images/batch_create2.jpg'>
In each form of the task, click **"x"** in the upper right to delete the task and click **Confirm** in the bottom right to fold this panel.
Click **Add Task** at the bottom to add a new task. There is only one task by default.
The first three configurations of the task are: Running Path, Command and Image.
The rest of the configuration needs to be clicked **Show detailed options**:
<img src='../images/batch_create3.jpg'>
Note
* CPU: The number of CPU cores used by each vnode.
* GPU The number of GPUs used by each vnode.
* Memory The amount of memory used by each vnode.
* Disk The disk size used by each vnode.
* Vnode Number: The number of vnodes that the task is running on.
* Max Retry Times: how many times a task will retry at most when being encountered with error run by each vnode.
* Expire Time: The timeout period of the task running on each virtual node, the task exceeding the time will be killed by the system.
* Dependency: Dependent tasks. Fill in the task number on which the task depends, separated by commas, such as: 1, 2. The task number appears after "**#**" in each task title. This task will be executed after the dependent task is completed.
* Stderr/Stdout Redirect Path Includes redirection of stderr and stdout. If ending with "/", the file named "{taskid}_{vnodeid}_stdout/stderr.txt" will be output to the folder, otherwise it will be output to the named file. The path needs to exist.
* Run on: "All vnodes" runs the same command on each vnode, and "One vnode(master)" runs only on the node with the host name batch-0.
* Start at the Same Time: Whether the tasks on each vnode need to be started at the same time.
* Object Storage Mapping: System support object storage service from external cloud providers. And can mount the bucket into the vnode. Needs to fill the form. Currently only supports Aliyun.
**Other default configuration**All vnodes of a task are in a local area network, and their host names are batch-0, batch-1 in order of startup. In addition, each vnode is open for ssh secret-free login, and both can access the external network.

View File

@ -0,0 +1,26 @@
# Job Statuses and Scheduling #
The statuses of the tasks in the job can be viewed through the **Info** button on the Batch index page:
<img src='../images/batch_status.jpg'>
The kinds of statuses of a task are shown as follows:
* pending: The task has not yet entered the scheduling queue.
* scheduling: The task is in the scheduling queue and waiting for scheduling.
* running: The task is being executed.
* retrying: An error has occurred and is about to be retried.
* failed: The task has failed and will not be executed again.
* finished: The task has completed successfully.
* stopped: The mission was stopped artificially.
The status of the job is affected by the state of the task, which can be changed as follows:
* failed: At least one task failed.
* finished: All tasks are finished.
* stopping: The job is stopping.
* stopped: The job has been artificially stopped.
* running: At least one task is being executed.
* pending others.
In Docklet, a task is the basic unit of scheduling execution. Tasks without dependencies go directly to the scheduling queue, and other tasks will not enter the queue until their dependent tasks are completed.

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

View File

@ -0,0 +1,29 @@
### Batch ###
BatchCompute is a distributed cloud service for parallel batch jobs. It can support batch jobs with directed acyclic graphs(DAG). Batch computing system will automatically completes resource management, job scheduling and data loading, and charges according to actual usage.
The batch page focuses on the main information for all batch jobs submitted by users:
<img src="../images/batch_index.jpg" width="600" alt="batch jobs info">
This page mainly supports the following operations:
1.Cick **Info** you will see the detailed information of batch jobs, including the statuses and detailed configurations of tasks of a job, shown as follows:
<img src="../images/batch_info.jpg" width="600" alt="detailed info of a batch job">
2.Click **Get Output**, can fetch outputs of stderr and stdout of all the tasks running on each vnode:
<img src="../images/batch_output.jpg" width="600" alt="output toggle for a job">
And click one button of them can show the output of stdout or stderr of the task running on the vnode:
<img src="../images/batch_detail.jpg" width="600" alt="detailed output for a job">
**Note**The page is updated automatically every 2 seconds and only shows the output of the last 100 lines.
3. Click **stop** can stop the running jobs. If you find jobs wrong configuration, please stop it in time for resources saving.
4. Click **Create Batch Job** will create a new batch job. More information in [Batch jobs creation and configuration](../batch/create.md).
More introduction on batch computing, please look up in [Batch Computing](../batch/README.md).

View File

@ -4,4 +4,4 @@
* 任务(task):任务是系统调度的基本单位。排队、执行都是以任务为单位进行的。一个任务运行在一个虚拟集群(vcuster)上,集群的节点称为虚节点(vnode)。
* 虚节点(vnode):一个任务可以运行在多个虚节点上,一个虚节点对应着一个封装了运行时环境和资源的容器运行时。一个任务的所有虚节点有着相同的资源配置和镜像。
* 虚节点(vnode):一个任务可以运行在多个虚节点上,一个虚节点对应着一个封装了运行时环境和资源的容器运行时。一个任务的所有虚节点有着相同的资源配置和镜像。