Monday, January 1, 2007

Searching

Searching is also pretty straightforward (on hindsight after spending a few days trying to get it to work), once you realize a few things.


Searching
(from Nutch Website)

To search you need to put the nutch war file into your servlet container. (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command ant war.)

Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:

rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war

Ok, this is the first time that I have used Tomcat or any similar tool which explains the fumbling. Anyway, I finally realized that the above 2 lines are only necessary if you want to start the Nutch search as the default application. So, the only thing you need to do is to copy the .war file to the Tomcat's webapps folder.


The above simply copies the .war file to the webapps folder without renaming.


The webapp finds its indexes in ./crawl, relative to where you start Tomcat, so use a command like:

~/local/tomcat/bin/catalina.sh start


If you did not place your crawl contents in crawl folder,
you will need to define the search directory.

1) First just start the Tomcat.



The .war file that you just copied to the
/tomcat_dir/webapps will be automatically
expanded, as evident from the image below.



If you have named your .war file to
say, abc.war, then it will expand
to a abc folder.

2) Amend the nutch-site.xml file in your
tomcat_dir/webapps/exanded_dir/WEB-INF/classes
folder. So, for my case, I will locate
the file here:




This is the content of my nutch-site.xml file.
(note: this nutch-site.xml file is the one
located in the tomcat_dir and not the
one in the nutch_dir folder.)



So you just need to put in the path of your
crawl directory where the indexes and segments
are placed after the crawl.
In this case, my folder is called crawl.test3.


Then visit http://localhost:8080/
and have fun!

Note: If you are using other ports for Tomcat, please use
the corresponding port number.

For example, if port 8888 is used, the address
will be http://localhost:8888.
This will bring you to the ROOT
application. If you have not changed
anything in the original root folder, then
you will be at the Tomcat start page which looks
like this:


However, if you have changed the root folder
like this:

rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war

Then, the nutch search page will be the
root application.

For my example, my nutch search page
is at http://localhost:8080/nutch-0.9/
as shown in the below image.




Now, you can verify that your search works
by inputing the search queries. If there
are no hits when there should be, maybe
the search directory is not set correctly.
Or the problem may lie with the crawling part.

Have fun!

10 comments:

xuar said...

Sorry for trying the nutch app so late. I am ok with all the steps except the one that move the .war file. I did that, and start tomcat, supprisingly, the .war file does not expended. What shall I do now??

GM said...

Hi,

May I ask what version of Tomcat you are using?

Also, try expanding the .war file manually using winRar.

Thanks

Bennie Blom said...

Thanks for the great nutch installation guide. Could you tell someting more about the use of Luke? I tried but i could not select the index... (see also the article of Tom White : http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html where they use nutch 0.7.1)

Anonymous said...

Thanks
This is a great article. I was able to set up nutch without any issues.

Anonymous said...

Good post.

Sheetal said...

To publish text with tags, replace your tags with their character entities. Replace < with & lt; (without space between & and lt;
Similarly replace > with & gt; (without space between & and gt;
See http://www.w3schools.com/HTML/html_entities.asp

bb said...

thanks! your tutorial works great!

Anonymous said...

Hi,
I'm having problems with the spaces in the path; due to the space in the windows 'Program Files' folder.

here is my java home path (directly copied from the Cygwin terminal).

$ echo $JAVA_HOME
C:\Program Files\Java\jdk1.6.0_24;

So when I run it,
$ ./bin/nutch crawl
./bin/nutch: line 158: C:\Program Files\Java\jdk1.6.0_24;/bin/java: No such file or directory
./bin/nutch: line 268: exec: C:\Program: not found

How can I solve this issue? Didn't you come up with this?
Thanks!

samy said...

For those who are having trouble due to the space in the 'Program Files', here is the solution!

create a new environment variable $NUTCH_JAVA_HOME

Set its value as below (see no trailing ';' and the folder name in the DOS way).
C:\PROGRA~1\Java\jdk1.6.0_24

precisely at echo,
$ echo $NUTCH_JAVA_HOME
C:\PROGRA~1\Java\jdk1.6.0_24

VOILA!

Neshomeimprovement said...

Thanks for sharing such useful Picture window installation Tips with us i really need that kind of Informations for my business please provide some Informations regarding window glass replacement Massachusetts, window repair Massachusetts.