Monday, July 23, 2007

Introduction

This blog is created as a guide to setting up Nutch on Windows. For more details on Nutch, please visit the following url: http://lucene.apache.org/nutch/

The following tools are required:

1) Nutch (using version 0.9 in this tutorial)
http://www.apache.org/dyn/closer.cgi/lucene/nutch/

2) Cygwin
http://www.cygwin.com

3) Apache Tomcat (using version 6.0 in this tutorial)
http://tomcat.apache.org/download-60.cgi

4) Luke (optional)
http://www.getopt.org/luke/


Most of the information are taken from the the tutorial on the Nutch homepage and the one from java.net. Please visit them for more details.

Should there be any queries, please leave them in the comments. I would then be notified via email and see if I can help in any ways.

Setting up Cygwin and Nutch

Cygwin is used to run Nutch on Windows. Of course, you may run Nutch on Linux if desired.

1) Go to Cygwin site to download setup.exe.

2) Run setup.exe to set up Cygwin. No additional package is required to run Nutch.

3) Download the Nutch package (please choose at least version 0.8).

4) Unzip the package, preferably to the Cygwin home folder for easy access.

5) Test that the installation works by typing the following in the nutch folder:



Verify that the following is shown:



6) Set Classpath to the Lucene core(core version may vary):




7) Set JAVA_HOME



Note: When setting CLASSPATH or JAVA_HOME, do not include folders that have names with spaces in them.
For example, naming the Nutch folder 'Nutch 0.9' instead of 'Nutch-0.9' will result in the CLASSPATH or JAVA_HOME not being recognized.


8) Type the following to verfiy that the paths are set correctly: './bin/nutch crawl'



The above output will appear if CLASSPATH is set correctly.


Nutch is now ready to crawl and index.



For further information on how to use Nutch, please follow the tutorials located in the Nutch website and the java.net introduction to Nutch. The urls are given in the Introduction post.