Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

GetStarted_python

Andy Feng edited this page Apr 22, 2016 · 9 revisions

Python for CaffeOnSpark

Before you begin, make sure you have exported or set the environment variables required for caffeonspark in the GetStarted guides (GetStarted_local_osx) or (GetStarted_local) or (GetStarted_yarn)

Setup your Python environment

You could install Python like given in section "Fresh python install" below or use the installation already in your machine. These are the dependent python module required in your Python installation.

Cython>=0.19.2
numpy>=1.7.1
scipy>=0.13.2
scikit-image>=0.9.3
matplotlib>=1.3.1
ipython>=3.0.0
h5py>=2.2.0
leveldb>=0.191
networkx>=1.8.1
nose>=1.3.0
pandas>=0.12.0
python-dateutil>=1.4,<2
protobuf>=2.5.0
python-gflags>=2.0
pyyaml>=3.10
Pillow>=2.3.0
six>=1.1.0
ipython[notebook]
py4j
pydot2

Graphviz (though not a python package) is another dependency which needs to be in your system. Once you have installed these python modules, export your IPYTHON_ROOT to your python installation and export as below:

export IPYTHON_ROOT=<your python installation>
export PATH="${IPYTHON_ROOT}/bin:${PATH}"
export PYSPARK_PYTHON=${IPYTHON_ROOT}/bin/python
export PATH=${SPARK_HOME}/bin:${IPYTHON_ROOT}/bin/:$PATH
Fresh python install

Skip this step if you pointed IPYTHON_ROOT to your own installation above.

export IPYTHON_ROOT=~/Python2.7.10 #Change this directory to install elsewhere.
curl -O https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
tar -xvf Python-2.7.10.tgz
rm Python-2.7.10.tgz
pushd Python-2.7.10 >/dev/null
./configure --prefix="${IPYTHON_ROOT}"
make
make install
popd >/dev/null
rm -rf Python-2.7.10
pushd "${IPYTHON_ROOT}" >/dev/null
curl -O https://bootstrap.pypa.io/get-pip.py
bin/python get-pip.py
rm get-pip.py
bin/pip install <All dependencies listed in the first section>
popd >/dev/null
Submit Python Script
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
spark-submit  --master ${MASTER_URL} \
	      --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
	      --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
	      --conf spark.cores.max=${TOTAL_CORES} \
	      --conf spark.task.cpus=${CORES_PER_WORKER} \
	      --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
	      --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
	      --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
	      --files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
	      --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
	      --conf spark.pythonargs="-conf lenet_memory_solver.prototxt -model file:///tmp/lenet.model -features accuracy,ip1,ip2 -label label -output file:///tmp/output -devices 1 -outputFormat json -clusterSize ${SPARK_WORKER_INSTANCES}" \
	      examples/MultiClassLogisticRegression.py
Launch Python Interactive Shell

We apply CaffeOnSpark for training model, extract features, and then run LogisticRegression against the extract features.

pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
IPYTHON=1 pyspark  --master ${MASTER_URL} \
	  	   --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
		   --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
		   --conf spark.cores.max=${TOTAL_CORES} \
		   --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
		   --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
		   --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
		   --files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so \
		   --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" 

Run examples
from pyspark import SparkConf,SparkContext
from com.yahoo.ml.caffe.RegisterContext import registerContext,registerSQLContext
from com.yahoo.ml.caffe.CaffeOnSpark import CaffeOnSpark
from com.yahoo.ml.caffe.Config import Config
from com.yahoo.ml.caffe.DataSource import DataSource
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
registerContext(sc)
registerSQLContext(sqlContext)
cos=CaffeOnSpark(sc,sqlContext)
cfg=Config(sc)
cfg.protoFile='/Users/afeng/dev/ml/CaffeOnSpark/data/lenet_memory_solver.prototxt'
cfg.modelPath = 'file:/tmp/lenet.model'
cfg.devices = 1
cfg.isFeature=True
cfg.label='label'
cfg.features=['ip1']
cfg.outputFormat = 'json'
cfg.clusterSize = 1
cfg.lmdb_partitions=cfg.clusterSize
#Train
dl_train_source = DataSource(sc).getSource(cfg,True)
cos.train(dl_train_source)
#Extract features
lr_raw_source = DataSource(sc).getSource(cfg,False)
extracted_df = cos.features(lr_raw_source)
extracted_df.show(10)
# Do multiclass LogisticRegression
data = extracted_df.map(lambda row: LabeledPoint(row.label[0], Vectors.dense(row.ip1)))
lr = LogisticRegressionWithLBFGS.train(data, numClasses=10, iterations=10)
predictions = lr.predict(data.map(lambda pt : pt.features))
predictions.take(10)
IPythonNotebook
Generate data for demo notebook

This step is required, only if you want to run the sample notebook given with the code under {CAFFE_ON_SPARK}/data/examples/DLDemo.ipyb. Skip to the next section in case you don't want to run the demo notebook.

rm -rf ${CAFFE_ON_SPARK}/data/mnist_train_dataframe
spark-submit --master local[1] \
             --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
             --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
             --class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
             ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
             -imageRoot file:${CAFFE_ON_SPARK}/data/mnist_train_lmdb \
             -lmdb_partitions ${TOTAL_CORES} \
             -outputFormat parquet \
             -output file:${CAFFE_ON_SPARK}/data/mnist_train_dataframe


rm -rf ${CAFFE_ON_SPARK}/data/mnist_test_dataframe
spark-submit --master local[1] \
             --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
             --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
             --class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
             ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
             -imageRoot file:${CAFFE_ON_SPARK}/data/mnist_test_lmdb \
             -lmdb_partitions ${TOTAL_CORES} \
             -outputFormat parquet \
             -output file:${CAFFE_ON_SPARK}/data/mnist_test_dataframe

Make sure that your ${CAFFE_ON_SPARK}/data/lenet_dataframe_train_test.prototxt is updated to point to ${CAFFE_ON_SPARK}/data/mnist_train_dataframe for training and ${CAFFE_ON_SPARK}/data/mnist_test_dataframe for test.

Launch IPythonNotebook
export IPYTHON_OPTS="notebook --no-browser --ip=`hostname`"
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
IPYTHON=1 pyspark  --master ${MASTER_URL} \
	  	   --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
		   --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
		   --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
		   --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
		   --conf spark.executorEnv.DYLD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
		   --py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
		   --conf spark.cores.max=${TOTAL_CORES} \
		   --conf spark.task.cpus=${CORES_PER_WORKER} \
		   --files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/lenet_dataframe_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_dataframe_train_test.prototxt \
		   --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" 

When you run the above, the console will output a url, which you should copy on your browser. There you need to click examples/DLDemo.ipynb. When executing the notebook, replace the path of various files like lenet_memory_solver.txt, mnist_dataframe_test with the full path of those files on your system.