Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SINGA-140: Fixed bug in CollectAll() function #141

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raunaqabhyankar
Copy link

In SINGA_HOME/src/worker.cc, in “int Worker::CollectAll(int step, NeuralNet* net){}” function, the layers which are unrolled (except for the first one) should not collect parameters, due to parameter sharing.

Previous:
if (layer->partition_id() == id_)
Current changes:
if (layer->partition_id() == id_ && layer->unroll_index() == 0)

@kaiping

@nudles
Copy link
Member

nudles commented Mar 30, 2016

Would you please change the commit message to follow this format "SINGA-xxx "?
Have you tried to run the char-rnn example after this commit?

@raunaqabhyankar
Copy link
Author

I'll change the commit message.
I haven't run the example. Can u pls tell me how to do that?
Thanks.

@nudles
Copy link
Member

nudles commented Mar 30, 2016

here are the instructions: http://singa.apache.org/docs/general-rnn.html

On Wed, Mar 30, 2016 at 6:24 PM, Raunaq Abhyankar [email protected]
wrote:

I'll change the commit message.
I haven't run the example. Can u pls tell me how to do that?
Thanks.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#141 (comment)

@raunaqabhyankar
Copy link
Author

Dear sir,
Hi! Could you please tell me what the steps for execution and the expected output should be? I went through (http://singa.apache.org/docs/general-rnn.html) but did not understand properly.
Thanks... :)

@nudles
Copy link
Member

nudles commented Apr 4, 2016

have you tried to run the example?
the instructions are similar to that of other examples (we have provided the job.conf file in the example/char-rnn).
Pls paste your output here.

@raunaqabhyankar
Copy link
Author

Original Code (no changes):
$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf -test
Unique JOB_ID is 18
Record job information to /tmp/singa-log/job-info/job-18-20160408-113927
Executing : ./singa -test -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 18 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0408 11:39:27.331846 6093 cluster.cc:50] proc #0 -> localhost (pid = 6093)
E0408 11:39:27.362449 6093 worker.cc:465] accuracy = nan, Loss = nan,

$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 3
Record job information to /tmp/singa-log/job-info/job-3-20160404-225756
Executing : ./singa [-resume] -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 3 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0404 22:57:56.371260 6570 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 6570)
E0404 22:57:56.398120 6592 server.cc:62] Server (group = 0, id = 0) start
E0404 22:57:56.398223 6593 worker.cc:68] Worker (group = 0, id = 0) start on GPU 0
E0404 22:57:58.417470 6593 char_rnn.cc:52] Vocab_size = 81
E0404 22:57:58.417582 6593 char_rnn.cc:72] Max iteration per epoch = 1
F0404 22:57:58.418169 6593 math_blob.h:730] Not implemented
*** Check failure stack trace: ***
@ 0x7f95f63377fd google::LogMessage::Fail()
@ 0x7f95f633947d google::LogMessage::SendToLog()
@ 0x7f95f63373e3 google::LogMessage::Flush()
@ 0x7f95f6339eae google::LogMessageFatal::~LogMessageFatal()
@ 0x7f95f6b77b30 singa::BPTTWorker::Forward()
@ 0x7f95f6b6fa0a singa::BPWorker::TrainOneBatch()
@ 0x7f95f6b79e29 singa::Worker::Run()
@ 0x7f95f55d8f30 (unknown)
@ 0x7f95f4df160a start_thread
@ 0x7f95f4b2ba4d __clone
./bin/singa-run.sh: line 109: 6570 Aborted (core dumped) $singa_run

Changed Code:
$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf -test
Unique JOB_ID is 19
Record job information to /tmp/singa-log/job-info/job-19-20160408-114146
Executing : ./singa -test -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 19 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0408 11:41:46.785352 6237 cluster.cc:50] proc #0 -> localhost (pid = 6237)
E0408 11:41:46.809041 6237 worker.cc:465] accuracy = nan, Loss = nan,

$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 4
Record job information to /tmp/singa-log/job-info/job-4-20160404-225906
Executing : ./singa [-resume] -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 4 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0404 22:59:06.511059 6839 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 6839)
E0404 22:59:06.537554 6861 server.cc:62] Server (group = 0, id = 0) start
E0404 22:59:06.537652 6862 worker.cc:68] Worker (group = 0, id = 0) start on GPU 0
E0404 22:59:08.574076 6862 char_rnn.cc:52] Vocab_size = 81
E0404 22:59:08.574199 6862 char_rnn.cc:72] Max iteration per epoch = 1
F0404 22:59:08.574826 6862 math_blob.h:730] Not implemented
*** Check failure stack trace: ***
@ 0x7fade34d07fd google::LogMessage::Fail()
@ 0x7fade34d247d google::LogMessage::SendToLog()
@ 0x7fade34d03e3 google::LogMessage::Flush()
@ 0x7fade34d2eae google::LogMessageFatal::~LogMessageFatal()
@ 0x7fade3d10b30 singa::BPTTWorker::Forward()
@ 0x7fade3d08a0a singa::BPWorker::TrainOneBatch()
@ 0x7fade3d12e29 singa::Worker::Run()
@ 0x7fade2771f30 (unknown)
@ 0x7fade1f8a60a start_thread
@ 0x7fade1cc4a4d __clone
./bin/singa-run.sh: line 109: 6839 Aborted (core dumped) $singa_run

@nudles This is the output. Before and after changes were made.

@nudles
Copy link
Member

nudles commented Apr 16, 2016

Hi,
pls compile SINGA with CUDA enabled.

./configure --enable-cuda --with-cuda=<cuda folder path>
make

If you do not have GPU (or CUDA), then comment out one line in job.conf

#gpu: 0

@raunaqabhyankar
Copy link
Author

raunaqabhyankar commented Apr 16, 2016

Hey thanks for the tip!
Here's the output
Original Code

[abhyankar@dhcppc4 incubator-singa]$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 4
Record job information to /tmp/singa-log/job-info/job-4-20160416-174208
Executing : ./singa -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 4 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0416 17:42:08.750080  3629 cluster.cc:50] proc #0 -> 0.0.0.0:49153 (pid = 3629)
E0416 17:42:08.776180  3651 server.cc:62] Server (group = 0, id = 0) start
E0416 17:42:08.776283  3652 worker.cc:68] Worker (group = 0, id = 0)  start on CPU
E0416 17:42:10.810894  3652 char_rnn.cc:52] Vocab_size = 81
E0416 17:42:10.811003  3652 char_rnn.cc:72] Max iteration per epoch = 1
E0416 17:42:11.357823  3652 worker.cc:465] Train @ step 0 accuracy = 0.120000, Loss = 230.064392, 
E0416 17:43:07.767719  3652 worker.cc:465] Train @ step 100 accuracy = 3.989800, Loss = 188.168106, 
E0416 17:44:03.478979  3652 worker.cc:465] Train @ step 200 accuracy = 4.135199, Loss = 183.716354, 
E0416 17:45:03.002893  3652 worker.cc:465] Train @ step 300 accuracy = 4.773601, Loss = 178.245834, 
^Z
[2]+  Stopped                 ./bin/singa-run.sh -conf examples/char-rnn/job.conf

Changed code

[abhyankar@dhcppc4 incubator-singa]$ ./bin/singa-run.sh -conf examples/char-rnn/job.conf
Unique JOB_ID is 3
Record job information to /tmp/singa-log/job-info/job-3-20160416-173813
Executing : ./singa -singa_conf /home/abhyankar/incubator-singa/conf/singa.conf -singa_job 3 -conf /home/abhyankar/incubator-singa/examples/char-rnn/job.conf
E0416 17:38:14.131456  3411 cluster.cc:50] proc #0 -> 0.0.0.0:49152 (pid = 3411)
E0416 17:38:14.147335  3433 server.cc:62] Server (group = 0, id = 0) start
E0416 17:38:14.147336  3434 worker.cc:68] Worker (group = 0, id = 0)  start on CPU
E0416 17:38:15.256013  3434 char_rnn.cc:52] Vocab_size = 81
E0416 17:38:15.265971  3434 char_rnn.cc:72] Max iteration per epoch = 1
E0416 17:38:15.834771  3434 worker.cc:465] Train @ step 0 accuracy = 0.080000, Loss = 230.700241, 
E0416 17:39:12.429210  3434 worker.cc:465] Train @ step 100 accuracy = 3.935000, Loss = 188.156631, 
E0416 17:40:08.664752  3434 worker.cc:465] Train @ step 200 accuracy = 4.251200, Loss = 183.603928, 
E0416 17:41:04.237298  3434 worker.cc:465] Train @ step 300 accuracy = 5.384400, Loss = 177.437698, 
^Z
[1]+  Stopped                 ./bin/singa-run.sh -conf examples/char-rnn/job.conf

@nudles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants