Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Merge branch V0.2 to Master (#143)
Browse files Browse the repository at this point in the history
* webui logpath and document (#135)

* Add webui document and logpath as a href

* fix tslint

* fix comments by Chengmin

* Pai training service bug fix and enhancement (#136)

* Add NNI installation scripts

* Update pai script, update NNI_out_dir

* Update NNI dir in nni sdk local.py

* Create .nni folder in nni sdk local.py

* Add check before creating .nni folder

* Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT

* Improve annotation (#138)

* Improve annotation

* Minor bugfix

* Selectively install through pip (#139)

Selectively install through pip 
* update setup.py

* fix paiTrainingService bugs (#137)

* fix nnictl bug

* add hdfs host validation

* fix bugs

* fix dockerfile

* fix install.sh

* update install.sh

* fix dockerfile

* Set timeout for HDFSUtility exists function

* remove unused TODO

* fix sdk

* add optional for outputDir and dataDir

* refactor dockerfile.base

* Remove unused import in hdfsclientUtility

* Add documentation for NNI PAI mode experiment (#141)

* Add documentation for NNI PAI mode

* Fix typo based on PR comments

* Exit with subprocess return code of trial keeper

* Remove additional exit code

* Fix typo based on PR comments

* update doc for smac tuner (#140)

* Revert "Selectively install through pip (#139)" due to potential pip install issue (#142)

* Revert "Selectively install through pip (#139)"

This reverts commit 1d17483.

* Add exit code of subprocess for trial_keeper

* Update README, add link to PAImode doc
  • Loading branch information
yds05 committed Sep 29, 2018
1 parent 36b583b commit 2a28a57
Show file tree
Hide file tree
Showing 58 changed files with 2,080 additions and 151 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ pip Installation Prerequisites
* git, wget

```
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.1
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.2
source ~/.bashrc
```

Expand Down Expand Up @@ -64,6 +64,7 @@ To learn more about how this example was constructed and how to analyze the expe
* [Tuners supported by NNI.](src/sdk/pynni/nni/README.md)
* [How to enable early stop (i.e. assessor) in an experiment?](docs/EnableAssessor.md)
* [How to run an experiment on multiple machines?](docs/RemoteMachineMode.md)
* [How to run an experiment on OpenPAI?](docs/PAIMode.md)
* [How to write a customized tuner?](docs/CustomizedTuner.md)
* [How to write a customized assessor?](examples/assessors/README.md)
* [How to resume an experiment?](docs/NNICTLDOC.md)
Expand Down
4 changes: 3 additions & 1 deletion deployment/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,6 @@ RUN pip3 --no-cache-dir install tensorflow-gpu==1.10.0
#
#Keras 2.1.6
#
RUN pip3 --no-cache-dir install Keras==2.1.6
RUN pip3 --no-cache-dir install Keras==2.1.6

WORKDIR /root
67 changes: 51 additions & 16 deletions deployment/Dockerfile.build.base
Original file line number Diff line number Diff line change
Expand Up @@ -22,27 +22,62 @@ FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

LABEL maintainer='Microsoft NNI Team<[email protected]>'

RUN apt-get update && apt-get install -y --no-install-recommends \
sudo apt-utils git curl vim unzip openssh-client wget \
build-essential cmake \
libopenblas-dev
ENV HADOOP_VERSION=2.7.2
LABEL HADOOP_VERSION=2.7.2

#
# Python 3.5
#
RUN apt-get install -y --no-install-recommends python3.5 python3.5-dev python3-pip python3-tk && \
pip3 install --no-cache-dir --upgrade pip setuptools && \
echo "alias python='python3'" >> /root/.bash_aliases && \
echo "alias pip='pip3'" >> /root/.bash_aliases
RUN DEBIAN_FRONTEND=noninteractive && \
apt-get -y update && \
apt-get -y install sudo \
apt-utils \
git \
curl \
vim \
unzip \
wget \
build-essential \
cmake \
libopenblas-dev \
automake \
openjdk-8-jdk \
openssh-client \
openssh-server \
lsof \
python3.5 \
python3-dev \
python3-pip \
python3-tk \
libcupti-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# numpy 1.14.3 scipy 1.1.0
RUN pip3 --no-cache-dir install \
numpy==1.14.3 scipy==1.1.0

#
#Install node 10.10.0, yarn 1.9.4, NNI v0.1
#Install hadoop
#
RUN wget -qO- http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \
tar xz -C /usr/local && \
mv /usr/local/hadoop-${HADOOP_VERSION} /usr/local/hadoop

#
#Install NNI
#
RUN git clone -b v0.1 https://github.com/Microsoft/nni.git
RUN cd nni && sh install.sh
RUN echo 'PATH=~/.local/node/bin:~/.local/yarn/bin:~/.local/bin:$PATH' >> ~/.bashrc
RUN cd .. && rm -rf nni
RUN pip3 install -v --user git+https://github.com/Microsoft/[email protected]

ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
HADOOP_INSTALL=/usr/local/hadoop \
NVIDIA_VISIBLE_DEVICES=all

ENV HADOOP_PREFIX=${HADOOP_INSTALL} \
HADOOP_BIN_DIR=${HADOOP_INSTALL}/bin \
HADOOP_SBIN_DIR=${HADOOP_INSTALL}/sbin \
HADOOP_HDFS_HOME=${HADOOP_INSTALL} \
HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_INSTALL}/lib/native \
HADOOP_OPTS="-Djava.library.path=${HADOOP_INSTALL}/lib/native"

ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/root/.local/bin:/usr/bin:/sbin:/bin:${HADOOP_BIN_DIR}:${HADOOP_SBIN_DIR} \
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:${JAVA_HOME}/jre/lib/amd64/server

WORKDIR /root
4 changes: 2 additions & 2 deletions docs/GetStarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@

* __Install NNI through pip__

python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.1
python3 -m pip install -v --user git+https://github.com/Microsoft/nni.git@v0.2
source ~/.bashrc

* __Install NNI through source code__

git clone -b v0.1 https://github.com/Microsoft/nni.git
git clone -b v0.2 https://github.com/Microsoft/nni.git
cd nni
chmod +x install.sh
source install.sh
Expand Down
79 changes: 79 additions & 0 deletions docs/PAIMode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
**Run an Experiment on OpenPAI**
===
NNI supports running an experiment on [OpenPAI](https://github.com/Microsoft/pai) (aka pai), called pai mode. Before starting to use NNI pai mode, you should have an account to access an [OpenPAI](https://github.com/Microsoft/pai) cluster. See [here](https://github.com/Microsoft/pai#how-to-deploy) if you don't have any OpenPAI account and want to deploy an OpenPAI cluster. In pai mode, your trial program will run in pai's container created by Docker.

## Setup environment
Install NNI, follow the install guide [here](GetStarted.md).

## Run an experiment
Use `examples/trials/mnist-annotation` as an example. The nni config yaml file's content is like:
```
authorName: your_name
experimentName: auto_mnist
# how many trials could be concurrently running
trialConcurrency: 2
# maximum experiment running duration
maxExecDuration: 3h
# empty means never stop
maxTrialNum: 100
# choice: local, remote, pai
trainingServicePlatform: pai
# choice: true, false
useAnnotation: true
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist-annotation
gpuNum: 0
cpuNum: 1
memoryMB: 8196
image: openpai/pai.example.tensorflow
dataDir: hdfs://10.1.1.1:9000/nni
outputDir: hdfs://10.1.1.1:9000/nni
# Configuration to access OpenPAI Cluster
paiConfig:
userName: your_pai_nni_user
passWord: your_pai_password
host: 10.1.1.1
```
Note: You should set `trainingServicePlatform: pai` in nni config yaml file if you want to start experiment in pai mode.

Compared with LocalMode and [RemoteMachineMode](RemoteMachineMode.md), trial configuration in pai mode have five additional keys:
* cpuNum
* Required key. Should be positive number based on your trial program's CPU requirement
* memoryMB
* Required key. Should be positive number based on your trial program's memory requirement
* image
* Required key. In pai mode, your trial program will be scheduled by OpenPAI to run in [Docker container](https://www.docker.com/). This key is used to specify the Docker image used to create the container in which your traill will run.
* dataDir
* Optional key. It specifies the HDFS data direcotry for trial to download data. The format should be something like hdfs://{your HDFS host}:9000/{your data directory}
* outputDir
* Optional key. It specifies the HDFS output direcotry for trial. Once the trial is completed (either succeed or fail), trial's stdout, stderr will be copied to this directory by NNI sdk automatically. The format should be something like hdfs://{your HDFS host}:9000/{your output directory}

Once complete to fill nni experiment config file and save (for example, save as exp_pai.yaml), then run the following command
```
nnictl create --config exp_pai.yaml
```
to start the experiment in pai mode. NNI will create OpanPAI job for each trial, and the job name format is something like `nni_exp_{experiment_id}_trial_{trial_id}`.
You can see the pai jobs created by NNI in your OpenPAI cluster's web portal, like:
![](./nni_pai_joblist.jpg)

Notice: In pai mode, NNIManager will start a rest server and listen on `51189` port, to receive metrics from trial job running in PAI container. So you should `enable 51189` TCP port in your firewall rule to allow incoming traffic.

Once a trial job is completed, you can goto NNI WebUI's overview page (like http://localhost:8080/oview) to check trial's information.

Expand a trial information in trial list view, click the logPath link like:
![](./nni_webui_joblist.jpg)

And you will be redirected to HDFS web portal to browse the output files of that trial in HDFS:
![](./nni_trial_hdfs_output.jpg)

You can see there're three fils in output folder: stderr, stdout, and trial.log

If you also want to save trial's other output into HDFS, like model files, you can use environment variable `NNI_OUTPUT_DIR` in your trial code to save your own output files, and NNI SDK will copy all the files in `NNI_OUTPUT_DIR` from trial's container to HDFS.

Any problems when using NNI in pai mode, plesae create issues on [NNI github repo](https://github.com/Microsoft/nni), or send mail to [email protected]

54 changes: 54 additions & 0 deletions docs/WebUI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# WebUI

## View summary page

Click the tab "Overview".

* See the experiment parameters.
* See search_space json.
* See good performance trial.

![](./img/overview.jpg)

## View job accuracy

Click the tab "Optimization Progress" to see the point graph of all trials. Hover every point to see its specific accuracy.

![](./img/accuracy.jpg)

## View hyper parameter

Click the tab "Hyper Parameter" to see the parallel graph.

* You can select the percentage to see top trials.
* Choose two axis to swap its positions

![](./img/searchspace.jpg)

## View trial status

Click the tab "Trial Status" to see the status of the all trials. Specifically:

* Trial duration: trial's duration in the bar graph.
* Trial detail: trial's id, trial's duration, start time, end time, status, accuracy and search space file.

![](./img/openRow.jpg)

* Kill: you can kill a job that status is running.
* Tensor: you can see a job in the tensorflow graph, it will link to the Tensorboard page.

![](./img/trialStatus.jpg)

* Intermediate Result Graph.

![](./img/intermediate.jpg)

## Control

Click the tab "Control" to add a new trial or update the search_space file and some experiment parameters.

![](./img/control.jpg)

## Feedback

[Known Issues](https://github.com/Microsoft/nni/issues).
Binary file added docs/img/accuracy.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/control.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/intermediate.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/openRow.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/overview.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/searchspace.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/trialStatus.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/nni_pai_joblist.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/nni_trial_hdfs_output.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/nni_webui_joblist.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions examples/trials/auto-gbdt/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
command: python3 main.py
codeDir: .
gpuNum: 0
gpuNum: 0
5 changes: 3 additions & 2 deletions examples/trials/mnist-annotation/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,13 @@ trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
gpuNum: 0
1 change: 1 addition & 0 deletions examples/trials/mnist-batch-tune-keras/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ searchSpacePath: search_space.json
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: BatchTuner
classArgs:
#choice: maximize, minimize
Expand Down
5 changes: 3 additions & 2 deletions examples/trials/mnist-keras/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist-keras.py
codeDir: .
gpuNum: 0
gpuNum: 0
5 changes: 3 additions & 2 deletions examples/trials/mnist-smartparam/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,13 @@ trainingServicePlatform: local
#choice: true, false
useAnnotation: true
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
gpuNum: 0
5 changes: 3 additions & 2 deletions examples/trials/mnist/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
gpuNum: 0
5 changes: 3 additions & 2 deletions examples/trials/mnist/config_assessor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ searchSpacePath: ~/nni/examples/trials/mnist/search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
Expand All @@ -23,4 +24,4 @@ assessor:
trial:
command: python3 mnist.py
codeDir: ~/nni/examples/trials/mnist
gpuNum: 0
gpuNum: 0
3 changes: 2 additions & 1 deletion examples/trials/pytorch_cifar10/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution
#choice: TPE, Random, Anneal, Evolution,
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
Expand Down
4 changes: 3 additions & 1 deletion install.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash
make build
make install-dependencies
make build
make dev-install
make install-examples
make update-bash-config
source ~/.bashrc
5 changes: 4 additions & 1 deletion src/nni_manager/training_service/pai/hdfsClientUtility.ts
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,10 @@ export namespace HDFSClientUtility {
const deferred : Deferred<boolean> = new Deferred<boolean>();
hdfsClient.exists(hdfsPath, (exist : boolean ) => {
deferred.resolve(exist);
})
});

// Set timeout and reject the promise once reach timeout (5 seconds)
setTimeout(() => deferred.reject(`Check HDFS path ${hdfsPath} exists timeout`), 5000);

return deferred.promise;
}
Expand Down
Loading

0 comments on commit 2a28a57

Please sign in to comment.