Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • debugging
  • master
2 results

Target

Select target project
  • luc.giffon/oar
1 result
Select Git revision
  • master
1 result
Show changes
Commits on Source (50)
MIT License
Copyright (c) 2018 Jeremy Auguste
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
This repository groups a set of tools to easily use OAR.
# Requirements
Obviously, OAR needs to be installed on the system to be able to make use of these commands.
Here are the requirements for each script:
- `oargen` requires python 2+ or 3+. It uses the `oarsub` command from OAR.
- `oarstats` requires python 2+ or 3+ and the *pyyaml* package. It uses the `oarstat` command from OAR.
- `batchoar` requires python 3+ and the *pyyaml* package. It uses the `oarsub` command from OAR.
# Usage
## OARGEN
The basic usage of `oargen` is simply `python oargen.py -r command` where *command* is the job
to execute on the cluster.
Several options can be used:
- --run: run the job on the cluster.
- --time *TIME*: reservation walltime of your job. Accepted formats: *h*, *h:m* or *h:m:s*. Defaults to 10 hours.
- --core *NB\_CORES*: number of cores required by your job. Defaults to 1.
- --interative: launch job in interative mode instead of passive.
- --gpu: request GPUs.
- --host *host1* [ *host2* ... *hostn* ]: name of the hosts allowed to be used on the cluster.
- --ignore-host *host1* [ *host2* ... *hostn* ]: name of the hosts forbidden to be used on the cluster.
- --anterior *ANTERIOR\_ID*: job will only be launched once the job with the specified ID is finished.
- --checkpoint *delay*: enable the checkpoint signal with the given delay (in seconds).
- --name *name*: name of the job.
- --directory *directory*: directory in which will be stored the logs of the standard output and input.
## OARSTATS
The basic usage of `oarstats` is simply `python oarstats.py`.
It will print some information about the number of jobs, cores and gpus used by
each user. It will also take into account the different queues.
One option can be used:
- --show-hosts: Adds an extra line for each user indicating which hosts are being used.
#+options: toc:nil _:{} ^:{} -:nil
This repository groups a set of tools to easily use OAR.
* Requirements
Obviously, OAR needs to be installed on the system to be able to make use of these commands.
Here are the requirements for each script:
+ ~oargen~ requires python 2+ or 3+. It uses the ~oarsub~ command from OAR.
+ ~oarstats~ requires python 2+ or 3+ and the /pyyaml/ package. It uses the ~oarstat~ command from OAR.
+ ~batchoar~ requires python 3+ and the /pyyaml/ package. It uses the ~oarsub~ command from OAR.
* Usage
** OARGEN
The basic usage of ~oargen~ is simply ~python oargen.py -r command~ where /command/ is the job
to execute on the cluster.
Several options can be used:
+ --run: run the job on the cluster.
+ --time /TIME/: reservation walltime of your job. Accepted formats: /h/, /h:m/ or /h:m:s/. Defaults to 10 hours.
+ --core /NB_CORES/: number of cores required by your job. Defaults to 1.
+ --interative: launch job in interative mode instead of passive.
+ --gpu: request GPUs.
+ --host /host1/ [ /host2/ ... /hostn/ ]: name of the hosts allowed to be used on the cluster.
+ --ignore-host /host1/ [ /host2/ ... /hostn/ ]: name of the hosts forbidden to be used on the cluster.
+ --anterior /ANTERIOR_ID/: job will only be launched once the job with the specified ID is finished.
+ --checkpoint /delay/: enable the checkpoint signal with the given delay (in seconds).
+ --name /name/: name of the job.
+ --directory /directory/: directory in which will be stored the logs of the standard output and input.
** OARSTATS
The basic usage of ~oarstats~ is simply ~python oarstats.py~.
It will print some information about the number of jobs, cores and gpus used by
each user. It will also take into account the different queues.
One option can be used:
+ --show-hosts: Adds an extra line for each user indicating which hosts are being used.
#!/usr/bin/env python
# coding: utf-8
import argparse
import yaml
import itertools
import logging
from collections import deque
import os
import oargen
def argparser():
parser = argparse.ArgumentParser()
parser.add_argument('job_file',
help="YAML file which contains all the jobs to launch")
parser.add_argument('-b', '--besteffort', action="store_true",
help="Launch job in besteffort mode")
parser.add_argument('-t', '--time', default="10",
help="Default maximum duration of the jobs (format: h[:m[:s]])")
parser.add_argument('-g', '--gpu', action="store_true",
help="If True, reserves only cores with GPUs")
parser.add_argument('-c', '--core', default=1, type=int,
help="Number of cores to reserve")
parser.add_argument('-H', '--host', nargs="+", default=[],
help="Name of the hosts (SQL 'LIKE' syntax accepted)")
parser.add_argument('-I', '--ignore-host', nargs="+", default=[],
help="Name of the hosts to ignore (SQL 'NOT LIKE' syntax accepted)")
parser.add_argument('-i', '--interactive', action="store_true",
help="Launch job in interactive mode")
parser.add_argument('-C', '--checkpoint', type=int, metavar="SECONDS",
help="Enable checkpoint signals with the given delay (in seconds)")
parser.add_argument('-j', '--max-jobs', type=int, default=-1)
parser.add_argument('--print-commands', action="store_true",
help="Print each individual command")
parser.add_argument('-r', '--run', action="store_true",
help="Run the command")
parser.add_argument('-d', '--directory',
help="Creates/specifies a directory and stores oarsub outputs in it.")
parser.add_argument('-l', '--logger', default='INFO',
choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
help="Logging level: DEBUG, INFO (default), WARNING, ERROR")
args = parser.parse_args()
numeric_level = getattr(logging, args.logger.upper(), None)
if not isinstance(numeric_level, int):
raise ValueError("Invalid log level: {}".format(args.logger))
logging.basicConfig(level=numeric_level)
return args
def load_variables(yaml_variables):
variable_names = []
variables = []
for name, values in yaml_variables.items():
variable_names.append(name)
variables.append(values)
return variable_names, list(itertools.product(*variables))
def load_flags(yaml_flags, args):
flags = {}
for command_flag, oar_flag in yaml_flags.items():
if getattr(args, oar_flag):
flags[command_flag] = command_flag
else:
flags[command_flag] = ""
return flags
def load_job(job_name, job, var_names, var_tuples, constants, flags, precommand):
for name, value in constants.items():
job = job.replace("${}$".format(name), str(value))
job_name = job_name.replace("${}$".format(name), str(value))
for flag, str_flag in flags.items():
job = job.replace("@{}@".format(flag), str_flag)
job_name = job_name.replace("@{}@".format(flag), str_flag)
if precommand:
job = "{} {}".format(precommand, job)
jobs = []
for var_tuple in var_tuples:
new_job = job
new_job_name = job_name
for var_name, var in zip(var_names, var_tuple):
new_job = new_job.replace("%{}%".format(var_name), str(var))
new_job_name = new_job_name.replace("%{}%".format(var_name), str(var))
jobs.append((new_job_name, new_job))
return jobs
def load_jobs(yaml_in, args):
yaml_jobs = yaml.load(yaml_in)
var_names, var_tuples = load_variables(yaml_jobs["variables"])
constants = yaml_jobs["constants"] if "constants" in yaml_jobs else {}
flags = load_flags(yaml_jobs["flags"], args)
try:
precommand = yaml_jobs["environment"]["precommand"]
except KeyError:
precommand = ""
jobs = []
for job_name, job in yaml_jobs["jobs"].items():
jobs.extend(load_job(job_name, job, var_names, var_tuples, constants, flags, precommand))
return jobs
def extract_job_id(cmd_output):
for line in cmd_output.splitlines():
if line.startswith("OAR_JOB_ID="):
return line.split('=')[1]
return None
def create_directory(directory, fake_run=False):
if directory is None or fake_run:
return
if os.path.isdir(directory):
return
os.mkdir(directory)
def main():
args = argparser()
with open(args.job_file) as fin:
jobs = load_jobs(fin, args)
anteriors = deque()
fake_id_counter = 0
create_directory(args.directory, fake_run=not args.run)
for job_name, job in jobs:
if args.max_jobs < 0 or len(anteriors) < args.max_jobs:
anterior = None
else:
anterior = anteriors.pop()
oar_command = oargen.prepare_oarsub(args.gpu, args.host, args.core, args.time,
ignore_hosts=args.ignore_host,
command=job, name=job_name,
output_directory=args.directory,
besteffort=args.besteffort,
checkpoint=args.checkpoint, anterior=anterior)
cmd_output = oargen.run_oarsub(oar_command, print_cmd=args.print_commands,
fake_run=not args.run, return_output=True)
if cmd_output is not None:
job_id = extract_job_id(cmd_output)
if job_id is None:
logging.warning("Job '{}' seems to have failed to launch...".format(job_name))
else:
anteriors.appendleft(job_id)
logging.debug("OAR_JOB_ID={}".format(job_id))
if anterior is not None:
logging.debug("ANTERIOR_JOB_ID={}".format(anterior))
elif args.run:
logging.warning("Job '{}' didn't return anything...".format(job_name))
else:
fake_id = "FAKE_ID_{}".format(fake_id_counter)
fake_id_counter += 1
anteriors.appendleft(fake_id)
logging.info("Scheduled {} jobs...".format(len(jobs)))
if __name__ == '__main__':
main()
constants:
dev: corpora/dev_c1200.delim.feats.daadless.hdf5
train: corpora/dev_c1200.delim.feats.daadless.hdf5
variables:
hidden: [64, 128, 256]
optimizer: ["Adam", "Adadelta"]
flags:
--cuda: gpu
environment:
precommand: pylauncher pytorch3
jobs:
train.rnn.attn.%optimizer%.c1200_h%hidden%.ws.conseil: python train.py @--cuda@ --bidirectional --optimizer %optimizer% --conv-dim %hidden% --embeddings-dim 100 3 --input-datanames x_word x_speaker --output-datanames y_conseil -e 7 --checkpoint checkpoints/c1200_h%hidden%.ws.%optimizer%.daadless.conseil.mdl --valid $dev$ $train$ models/c1200_h%hidden%.ws.%optimizer%.daadless.conseil.mdl
train.rnn.attn.%optimizer%.c1200_h%hidden%.ws.nps_3class: python train.py @--cuda@ --bidirectional --optimizer %optimizer% --conv-dim %hidden% --embeddings-dim 100 3 --input-datanames x_word x_speaker --output-datanames y_3class_nps -e 7 --checkpoint checkpoints/c1200_h%hidden%.ws.%optimizer%.daadless.conseil.nps_3class.mdl --valid $dev$ $train$ models/c1200_h%hidden%.ws.%optimizer%.daadless.nps_3class.mdl
......@@ -4,90 +4,133 @@
from __future__ import print_function
from __future__ import unicode_literals
import os
import argparse
# import logging
import subprocess
def argparser():
parser = argparse.ArgumentParser()
parser.add_argument('command', nargs='?',
# parser.add_argument('command', nargs='?',
# help="Command to use for the job (in passive mode)")
parser.add_argument('command', nargs=argparse.REMAINDER,
help="Command to use for the job (in passive mode)")
parser.add_argument('argument', nargs=argparse.REMAINDER,
help="Arguments of the command (in passive mode)")
parser.add_argument('-n', '--name',
help="Name to give to the job")
parser.add_argument('-d', '--directory',
help="Directory in which will be stored oarsub outputs")
parser.add_argument('-b', '--besteffort', action="store_true",
help="Launch job in besteffort mode")
parser.add_argument('-t', '--time', default=12, type=int,
help="Estimated maximum duration of the job (in hours)")
parser.add_argument('-t', '--time', default="10",
help="Estimated maximum duration of the job (format: h[:m[:s]]) (default: %(default)s)")
parser.add_argument('-g', '--gpu', action="store_true",
help="If True, reserves only cores with GPUs")
parser.add_argument('-c', '--core', default=1, type=int,
help="Number of cores to reserve")
parser.add_argument('-H', '--host',
help="Name of the host (SQL LIKE syntax accepted)")
help="Number of cores to reserve. (default: %(default)s)")
parser.add_argument('-H', '--host', nargs="+", default=[],
help="Name of the hosts (SQL 'LIKE' syntax accepted)")
parser.add_argument('-I', '--ignore-host', nargs="+", default=[],
help="Name of the hosts to ignore (SQL 'NOT LIKE' syntax accepted)")
parser.add_argument('-i', '--interactive', action="store_true",
help="Launch job in interactive mode")
parser.add_argument('-C', '--checkpoint', type=int, metavar="SECONDS",
help="Enable checkpoint signals with the given delay (in seconds)")
parser.add_argument('-r', '--run', action="store_true",
help="Run the command")
# parser.add_argument('-l', '--logger', default='INFO',
# choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
# help="Logging level: DEBUG, INFO (default), WARNING, ERROR")
parser.add_argument('-a', '--anterior',
help="Anterior job id that must be terminated to start this new one")
args = parser.parse_args()
# numeric_level = getattr(logging, args.logger.upper(), None)
# if not isinstance(numeric_level, int):
# raise ValueError("Invalid log level: {}".format(args.logger))
# logging.basicConfig(level=numeric_level)
return args
def main():
args = argparser()
command = ["oarsub"]
command.append("-p")
def prepare_oarsub(gpu, hosts, core, time,
ignore_hosts=[],
command=None,
interactive=False,
name=None, output_directory=None,
besteffort=False,
checkpoint=None, anterior=None):
oar_cmd = ["oarsub"]
oar_cmd.append("-p")
properties = ""
if args.gpu:
if gpu:
properties += "(gpu IS NOT NULL)"
else:
properties += "(gpu IS NULL)"
if args.host is not None:
properties += " AND host LIKE '{}'".format(args.host)
if hosts:
properties += " AND ("
for idx, host in enumerate(hosts):
if idx != 0:
properties += " OR "
properties += "host LIKE '{}'".format(host)
properties += ")"
if ignore_hosts:
for host in ignore_hosts:
properties += " AND host NOT LIKE '{}'".format(host)
properties += ""
command.append(properties)
oar_cmd.append(properties)
oar_cmd.append("-l")
time = time.split(':')
hour = time[0]
minutes = time[1] if len(time) >= 2 else "00"
seconds = time[2] if len(time) >= 3 else "00"
ressources = "core={},walltime={}:{}:{}".format(core, hour, minutes, seconds)
oar_cmd.append(ressources)
if name is not None:
oar_cmd.append("-n")
oar_cmd.append(name)
directory = output_directory + "/" if output_directory is not None else ""
oar_cmd.append("-O")
oar_cmd.append("{}{}.%jobid%.stdout".format(directory, name))
oar_cmd.append("-E")
oar_cmd.append("{}{}.%jobid%.stderr".format(directory, name))
if besteffort:
oar_cmd.extend(["-t", "besteffort", "-t", "idempotent"])
if checkpoint is not None:
oar_cmd.extend(["--checkpoint", checkpoint])
if anterior is not None:
oar_cmd.extend(["-a", anterior])
if interactive:
oar_cmd.append('-I')
else:
job_command = command if isinstance(command, str) else " ".join(command)
oar_cmd.append(job_command)
return oar_cmd
command.append("-l")
ressources = "core={},walltime={}:00:00".format(args.core, args.time)
command.append(ressources)
if args.name is not None:
command.append("-n")
command.append(args.name)
command.append("-O")
command.append("{}.%jobid%.stdout".format(args.name))
command.append("-E")
command.append("{}.%jobid%.stderr".format(args.name))
def run_oarsub(command, print_cmd=False, fake_run=False, return_output=False):
if print_cmd:
print(subprocess.list2cmdline(command))
if fake_run:
return None
if not return_output:
subprocess.call(command)
return None
return subprocess.check_output(command).decode("utf8")
if args.besteffort:
command.extend(["-t", "besteffort", "-t", "idempotent"])
if args.checkpoint is not None:
command.extend(["--checkpoint", args.checkpoint])
def main():
args = argparser()
if args.interactive:
command.append('-I')
else:
job_command = [args.command] + args.argument
command.append(" ".join(job_command))
if args.directory is not None and not os.path.isdir(args.directory) and args.run:
raise RuntimeError("'{}' is not a directory!".format(args.directory))
print(subprocess.list2cmdline(command))
if args.run:
subprocess.call(command)
oar_command = prepare_oarsub(args.gpu, args.host, args.core, args.time,
ignore_hosts=args.ignore_host,
command=args.command,
interactive=args.interactive,
name=args.name, output_directory=args.directory,
besteffort=args.besteffort, checkpoint=args.checkpoint,
anterior=args.anterior)
run_oarsub(oar_command, print_cmd=True, fake_run=not args.run)
if __name__ == '__main__':
......
#!/usr/bin/env python
# coding: utf-8
from __future__ import print_function
from __future__ import unicode_literals
import argparse
import logging
import sys
try:
import yaml
except ImportError:
print("Package 'pyyaml' needs to be installed to be able to use this script!", file=sys.stderr)
exit(1)
import subprocess
from collections import defaultdict
import re
import time
import datetime
class Owner:
def __init__(self, name):
self.name = name
self.queues = defaultdict(list)
self.karma = defaultdict(float)
self.timeleft = 0
self.running = defaultdict(int)
self.running_cores = defaultdict(int)
self.running_gpus = defaultdict(int)
self.resources = defaultdict(int)
self.gpu = defaultdict(int)
self.devices = defaultdict(int)
def add_job(self, job):
self.queues[job.queue].append(job)
if job.karma > self.karma[job.queue]:
self.karma[job.queue] = job.karma
self.timeleft += job.wall_time - job.elapsed_time
self.resources[job.queue] += job.resources
if job.elapsed_time != 0:
self.running[job.queue] += 1
self.running_cores[job.queue] += job.resources
if job.gpu:
self.running_gpus[job.queue] += job.resources
if job.gpu:
self.gpu[job.queue] += job.resources
for device in job.devices:
self.devices[device] += 1
def print_info(self, show_devices=False):
print("User {} :: Total Time Reserved: {}".format(self.name, datetime.timedelta(seconds=self.timeleft)))
for queue in self.queues.keys():
print("\t{} - Running: {} jobs ({} cores, {} gpus), Total: {} jobs ({} cores, {} gpus) - Karma: {}".format(queue,
self.running[queue],
self.running_cores[queue],
self.running_gpus[queue],
len(self.queues[queue]),
self.resources[queue],
self.gpu[queue],
self.karma[queue]))
if show_devices and self.devices:
print("Running on: {}".format(" ".join(["{}:{}".format(device, amount) for device, amount in self.devices.items()])))
class Job:
def __init__(self, job_id, elapsed_time, wall_time, resources, devices, gpu, queue, karma):
self.job_id = job_id
self.elapsed_time = elapsed_time
self.wall_time = wall_time
self.resources = resources
self.devices = devices
self.gpu = gpu
self.queue = queue
self.karma = karma
def argparser():
parser = argparse.ArgumentParser()
parser.add_argument('--show-hosts', action="store_true")
parser.add_argument('-l', '--logger', default='INFO',
choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
help="Logging level: DEBUG, INFO (default), WARNING, ERROR")
args = parser.parse_args()
numeric_level = getattr(logging, args.logger.upper(), None)
if not isinstance(numeric_level, int):
raise ValueError("Invalid log level: {}".format(args.logger))
logging.basicConfig(level=numeric_level)
return args
def main():
args = argparser()
stats_output = subprocess.check_output(["oarstat", "--yaml"]).decode('utf-8')
stats_yaml = yaml.safe_load(stats_output)
owners = {}
resources_pattern = re.compile('R=([0-9]+)')
walltime_pattern = re.compile('W=([0-9]+(:[0-9]+(:[0-9]+)?)?)')
queue_pattern = re.compile('Q=(\S+)')
karma_pattern = re.compile('Karma=([0-9]+\.[0-9]+)')
gpu_pattern = re.compile('gpu is not null', flags=re.IGNORECASE)
for job_id, job_info in stats_yaml.items():
if job_info["owner"] not in owners:
owners[job_info["owner"]] = Owner(job_info["owner"])
elapsed_time = 0 if job_info["startTime"] == 0 else time.time() - job_info["startTime"]
if job_info["message"] == '':
continue
tokens = re.search(walltime_pattern, job_info["message"]).group(1).split(':')
wall_time = int(tokens[0]) * 3600 + int(tokens[1]) * 60 + int(tokens[2])
resources = int(re.search(resources_pattern, job_info["message"]).group(1))
try:
queue = re.search(queue_pattern, job_info["message"]).group(1)
except AttributeError:
queue = "besteffort"
try:
karma = float(re.search(karma_pattern, job_info["message"]).group(1))
except AttributeError:
karma = 0.0
devices = job_info["assigned_network_address"]
gpu = re.search(gpu_pattern, job_info["properties"]) is not None
job = Job(job_id, elapsed_time, wall_time, resources, devices, gpu, queue, karma)
owners[job_info["owner"]].add_job(job)
for owner in owners.values():
owner.print_info(show_devices=args.show_hosts)
print()
if __name__ == '__main__':
main()
#! /bin/zsh
if [[ $# -lt 2 ]]; then
echo "Usage: $0 environment_name command [arg1..argn]" >&2
echo "Usage: $0 environment_name [OPTIONS] command [arg1..argn]" >&2
echo "OPTIONS:" >&2
echo " -h, --help Show this message" >&2
echo " -p, --cuda-path Specify the path where cuda is installed (default: /usr/local/cuda-10.1)" >&2
echo " -a, --anaconda-path Specify the path where anaconda is installed (default: /storage/raid1/homedirs/jeremy.auguste/.anaconda3)" >&2
exit 1
fi
cuda_path="/usr/local/cuda-10.1"
anaconda_path="/storage/raid1/homedirs/jeremy.auguste/.anaconda3"
while [[ $1 == -* ]]; do
case "$1" in
-p|--cuda-path)
cuda_path="$2"
shift 2
;;
-a|--anaconda-path)
anaconda_path="$2"
shift 2
;;
-h|--help)
echo "Usage: $0 environment_name [OPTIONS] command [arg1..argn]"
echo "OPTIONS:"
echo " -h, --help Show this message"
echo " -p, --cuda-path Specify the path where cuda is installed (default: /usr/local/cuda-10.1)"
echo " -a, --anaconda-path Specify the path where anaconda is installed (default: /storage/raid1/homedirs/jeremy.auguste/.anaconda3)"
exit 0
;;
*)
echo "Error: Unknown option: $1" >&2
echo "Usage: $0 environment_name [OPTIONS] command [arg1..argn]" >&2
exit 1
;;
esac
done
environment="$1"
shift 1
source activate $environment
source "$anaconda_path/etc/profile.d/conda.sh"
conda activate $environment
if [[ $environment =~ ^keras.* || $environment =~ ^pytorch.* ]]; then
export CUDA_HOME=/usr/local/cuda-8.0
export CUDA_HOME=$cuda_path
export CUDA_ROOT=$CUDA_HOME
export PATH=$CUDA_HOME/bin:$PATH
export MANPATH=$CUDA_HOME/doc/man:$MANPATH
......