Monday, July 23, 2007

Setting up Cygwin and Nutch

Cygwin is used to run Nutch on Windows. Of course, you may run Nutch on Linux if desired.

1) Go to Cygwin site to download setup.exe.

2) Run setup.exe to set up Cygwin. No additional package is required to run Nutch.

3) Download the Nutch package (please choose at least version 0.8).

4) Unzip the package, preferably to the Cygwin home folder for easy access.

5) Test that the installation works by typing the following in the nutch folder:



Verify that the following is shown:



6) Set Classpath to the Lucene core(core version may vary):




7) Set JAVA_HOME



Note: When setting CLASSPATH or JAVA_HOME, do not include folders that have names with spaces in them.
For example, naming the Nutch folder 'Nutch 0.9' instead of 'Nutch-0.9' will result in the CLASSPATH or JAVA_HOME not being recognized.


8) Type the following to verfiy that the paths are set correctly: './bin/nutch crawl'



The above output will appear if CLASSPATH is set correctly.


Nutch is now ready to crawl and index.



For further information on how to use Nutch, please follow the tutorials located in the Nutch website and the java.net introduction to Nutch. The urls are given in the Introduction post.

9 comments:

Anonymous said...

If you encounter this problem:
$ ./bin/nutch
./bin/nutch: line 15: syntax error near un
'/bin/nutch: line 15: `case "`uname`" in

run d2u:
$ d2u bin/nutch
bin/nutch: done.

samy said...

For those who are having trouble due to the space in the 'Program Files' and you are getting an error like below,

$ ./bin/nutch crawl
./bin/nutch: line 158: C:\Program Files\Java\jdk1.6.0_24;/bin/java: No such file or directory
./bin/nutch: line 268: exec: C:\Program: not found

Here is the solution!

create a new environment variable $NUTCH_JAVA_HOME

Set its value as below (see no trailing ';' and the folder name in the DOS way).
C:\PROGRA~1\Java\jdk1.6.0_24

precisely at echo,
$ echo $NUTCH_JAVA_HOME
C:\PROGRA~1\Java\jdk1.6.0_24

Anonymous said...

@samy

Thanx a lot man, you saved my day.

stetsa said...

hi.i ve read your post. but i got this: please i need your help ://


$ ./nutch crawl
cygpath: can't convert empty path
Error occurred during initialization of VM
java/lang/ClassNotFoundException: error in opening JAR file C:\PROGRA~1\Java\jdk1.7.0_03\jre\lib\rt.jar

mugeesh said...

cygpath: can't convert empty path
i hv also got same error like stetsa
so,please help me to overcome this problem..or sent me my email mugeesh@gmail.com

Unknown said...

cygpath: can't convert empty path
./nutch: line 268: exec: C:\Program: not found
what a damn i have been trying for a long time any one pls help me

kirakblog said...

I too getting the same error c:\Program Files\ Not found.Can you please any one explain

Unknown said...

I am new to nutch and I have a problem with my initial deployment Cygwin I have a problem compiling:
$ cd apache-nutch-1.4-bin/runtime/
-bash: cd: apache-nutch-1.4-bin/runtime /: No such file or directory,
i'would like to know why and how to fix it, Can you please help me?

none said...

- change your jdk path in environment variables to not have spaces.
- In windows 7 its control panel>system>advanced system settings - advanced tab and click on Environment Variables.
- Change JAVA_HOME in user variables and system variables.
- To find the name of the path without spaces open cmd prompt and issue "dir /x" and this will show you the paths that windows uses without spaces. Usually they follow 6 character followed by ~1. SO on my windows 7 64 bit I installed java 8 jdk in program files x86 so my path was c:\progra~2\java\jdk1.8.0.0_25