<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-892765308925882386</id><updated>2012-01-21T10:56:50.937-08:00</updated><title type='text'>Nutch Installation Guide</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://nutchinstall.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://nutchinstall.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>GM</name><uri>http://www.blogger.com/profile/11835516323622030621</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>4</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-892765308925882386.post-514677757183055504</id><published>2007-07-23T09:55:00.000-07:00</published><updated>2007-08-07T21:10:31.602-07:00</updated><title type='text'>Introduction</title><content type='html'>This blog is created as a guide to setting up &lt;span style="font-weight: bold; font-style: italic;"&gt;Nutch&lt;/span&gt; on Windows. For more details on &lt;span style="font-weight: bold; font-style: italic;"&gt;Nutch, &lt;/span&gt;please visit the following url: &lt;span style="font-style: italic;"&gt; &lt;a href="http://lucene.apache.org/nutch"&gt;http://lucene.apache.org/nutch&lt;/a&gt;&lt;/span&gt;/&lt;br /&gt;&lt;br /&gt;The following tools are required:&lt;br /&gt;&lt;br /&gt;1) &lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;Nutch&lt;/span&gt;&lt;/span&gt; (using version 0.9 in this tutorial)&lt;br /&gt;&lt;a style="font-style: italic; color: rgb(51, 102, 255);" href="http://www.apache.org/dyn/closer.cgi/lucene/nutch/"&gt; http://www.apache.org/dyn/closer.cgi/lucene/nutch/ &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;2) &lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;Cygwin&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;a style="font-style: italic;" href="http://www.cygwin.com/"&gt;http://www.cygwin.com&lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;3) &lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;Apache Tomcat &lt;/span&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;(using version 6.0 in this tutorial)&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;a style="font-style: italic;" href="http://tomcat.apache.org/download-60.cgi"&gt;http://tomcat.apache.org/download-60.cgi&lt;/a&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;4) &lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;Luke &lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span style="font-style: italic;"&gt;(optional)&lt;/span&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic; color: rgb(51, 51, 255);"&gt;&lt;a href="http://www.getopt.org/luke/"&gt;http://www.getopt.org/luke/&lt;/a&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Most of the information are taken from the the tutorial on the &lt;span style="font-style: italic;"&gt;&lt;a href="http://lucene.apache.org/nutch/tutorial8.html"&gt;&lt;span style="font-weight: bold;"&gt;Nutch &lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;a href="http://lucene.apache.org/nutch/tutorial8.html"&gt;homepage&lt;/a&gt; and the one from &lt;a href="http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;java.net&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;. Please visit them for more details.&lt;br /&gt;&lt;br /&gt;Should there be any queries, please leave them in the comments. I would then be notified via email and see if I can help in any ways.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/892765308925882386-514677757183055504?l=nutchinstall.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nutchinstall.blogspot.com/feeds/514677757183055504/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=892765308925882386&amp;postID=514677757183055504' title='19 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/514677757183055504'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/514677757183055504'/><link rel='alternate' type='text/html' href='http://nutchinstall.blogspot.com/2007/07/introduction.html' title='Introduction'/><author><name>GM</name><uri>http://www.blogger.com/profile/11835516323622030621</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>19</thr:total></entry><entry><id>tag:blogger.com,1999:blog-892765308925882386.post-5014735774284717390</id><published>2007-07-23T03:04:00.001-07:00</published><updated>2007-08-09T10:11:54.044-07:00</updated><title type='text'>Setting up Cygwin and Nutch</title><content type='html'>Cygwin is used to run Nutch on Windows. Of course, you may run Nutch on Linux if desired.&lt;br /&gt;&lt;br /&gt;1) Go to &lt;a href="http://www.blogger.com/www.cygwin.com"&gt;Cygwin site&lt;/a&gt; to download setup.exe.&lt;br /&gt;&lt;br /&gt;2) Run setup.exe to set up Cygwin. No additional package is required to run Nutch.&lt;br /&gt;&lt;br /&gt;3) Download the Nutch package (please choose at least version 0.8).&lt;br /&gt;&lt;br /&gt;4) Unzip the package, preferably to the Cygwin home folder for easy access.&lt;br /&gt;&lt;br /&gt;5) Test that the installation works by typing the following in the nutch folder:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_HgervwemU1s/RqSpwSPxU9I/AAAAAAAAABA/eGb5RolgnUQ/s1600-h/test.bmp"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_HgervwemU1s/RqSpwSPxU9I/AAAAAAAAABA/eGb5RolgnUQ/s400/test.bmp" alt="" id="BLOGGER_PHOTO_ID_5090380125832303570" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Verify that the following is shown:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_HgervwemU1s/RqSrmSPxU_I/AAAAAAAAABQ/KDC8R95XtAA/s1600-h/test2.JPG"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_HgervwemU1s/RqSrmSPxU_I/AAAAAAAAABQ/KDC8R95XtAA/s400/test2.JPG" alt="" id="BLOGGER_PHOTO_ID_5090382153056867314" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;6) Set Classpath to the Lucene core(core version may vary):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_HgervwemU1s/RqTFVCPxVAI/AAAAAAAAABY/Ki8sovRwY2U/s1600-h/test3.JPG"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_HgervwemU1s/RqTFVCPxVAI/AAAAAAAAABY/Ki8sovRwY2U/s400/test3.JPG" alt="" id="BLOGGER_PHOTO_ID_5090410444006446082" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;7) Set JAVA_HOME&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_HgervwemU1s/RrtKfSPxVNI/AAAAAAAAADA/52mCIYJURtE/s1600-h/tut11.JPG"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_HgervwemU1s/RrtKfSPxVNI/AAAAAAAAADA/52mCIYJURtE/s400/tut11.JPG" alt="" id="BLOGGER_PHOTO_ID_5096749304634234066" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note: When setting CLASSPATH or JAVA_HOME, do not include folders that have names with spaces in them.&lt;br /&gt;For example, naming the Nutch folder 'Nutch 0.9' instead of 'Nutch-0.9' will result in the CLASSPATH or JAVA_HOME not being recognized.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;8) Type the following to verfiy that the paths are set correctly: './bin/nutch crawl'&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_HgervwemU1s/RqTSGSPxVBI/AAAAAAAAABg/IfVp-1wLFPQ/s1600-h/test4.JPG"&gt;&lt;img style="cursor: pointer;" src="http://1.bp.blogspot.com/_HgervwemU1s/RqTSGSPxVBI/AAAAAAAAABg/IfVp-1wLFPQ/s400/test4.JPG" alt="" id="BLOGGER_PHOTO_ID_5090424484254536722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The above output will appear if CLASSPATH is set correctly.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Nutch is now ready to crawl and index.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For further information on how to use Nutch, please follow the tutorials located in the Nutch website and the java.net introduction to Nutch. The urls are given in the Introduction post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/892765308925882386-5014735774284717390?l=nutchinstall.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nutchinstall.blogspot.com/feeds/5014735774284717390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=892765308925882386&amp;postID=5014735774284717390' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/5014735774284717390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/5014735774284717390'/><link rel='alternate' type='text/html' href='http://nutchinstall.blogspot.com/2007/07/setting-up-cygwin-and-nutch.html' title='Setting up Cygwin and Nutch'/><author><name>GM</name><uri>http://www.blogger.com/profile/11835516323622030621</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_HgervwemU1s/RqSpwSPxU9I/AAAAAAAAABA/eGb5RolgnUQ/s72-c/test.bmp' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-892765308925882386.post-8658368717675623864</id><published>2007-01-02T04:42:00.000-08:00</published><updated>2007-08-07T18:58:33.564-07:00</updated><title type='text'>Crawling</title><content type='html'>The Nutch website already has a tutorial on crawling available &lt;a href="http://lucene.apache.org/nutch/tutorial8.html"&gt;here&lt;/a&gt; so I will just add some screenshots and further details if needed to make it easier.&lt;br /&gt;&lt;br /&gt;This tutorial will only be focusing on the intranet searching part for simplicity.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;Intranet Configuration&lt;br /&gt;&lt;/span&gt;(Source:  Nutch Website)&lt;br /&gt;&lt;br /&gt;&lt;p&gt;To configure things for intranet crawling you must:&lt;br /&gt;&lt;/p&gt;&lt;p&gt;1) Create a directory with a flat file of root urls.  For example, to crawl the &lt;span class="codefrag"&gt;nutch&lt;/span&gt; site you might start with a file named &lt;span class="codefrag"&gt;urls/nutch&lt;/span&gt; containing the url of just the Nutch home page.  All other Nutch pages should be reachable from this page.  The &lt;span class="codefrag"&gt;urls/nutch&lt;/span&gt; file would thus contain:&lt;/p&gt;&lt;p&gt;http://lucene.apache.org/nutch/&lt;/p&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 51, 255);"&gt;Note:  To start crawling from more than one url, you can add in more files containing the urls to be crawled. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_HgervwemU1s/Rrg2TSPxVCI/AAAAAAAAABo/umYvyiiLBMk/s1600-h/tut1.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp0.blogger.com/_HgervwemU1s/Rrg2TSPxVCI/AAAAAAAAABo/umYvyiiLBMk/s400/tut1.JPG" alt="" id="BLOGGER_PHOTO_ID_5095882683313116194" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 0, 204);"&gt;For example, in the &lt;span style="font-weight: bold;"&gt;urls&lt;/span&gt; folder, I have three flat files containing the url of SOC's homepage, NUS's homepage and Nutch's homepage respectively.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;2) Edit the file &lt;span style="font-weight: bold;" class="codefrag"&gt;conf/crawl-urlfilter.txt&lt;/span&gt; and replace &lt;span class="codefrag"&gt;MY.DOMAIN.NAME&lt;/span&gt; with the name of the domain you wish to crawl.  For example, if you wished to limit the crawl to the &lt;span class="codefrag"&gt;apache.org&lt;/span&gt; domain, the line should read: &lt;pre class="code"&gt;+^http://([a-z0-9]*\.)*apache.org/&lt;br /&gt;&lt;/pre&gt; This will include any url in the domain &lt;span class="codefrag"&gt;apache.org&lt;/span&gt;.&lt;br /&gt;&lt;h3 class="h4"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_HgervwemU1s/Rrg4XCPxVDI/AAAAAAAAABw/tvv81rZi4LA/s1600-h/tut2.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp3.blogger.com/_HgervwemU1s/Rrg4XCPxVDI/AAAAAAAAABw/tvv81rZi4LA/s400/tut2.JPG" alt="" id="BLOGGER_PHOTO_ID_5095884946760881202" border="0" /&gt;&lt;/a&gt;&lt;/h3&gt; &lt;span style="color: rgb(102, 0, 204);"&gt;For example, I have limit the crawl to these two domains as above.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Edit the file &lt;span class="codefrag"&gt;conf/nutch-site.xml&lt;/span&gt;, insert at minimum following properties into it and edit in proper values for the properties:&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 0, 204);"&gt;For example, the nutch-site.xml file will include something like that below. You will need to input some information between the value and /value part.&lt;br /&gt;&lt;/span&gt;&lt;pre class="code"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_HgervwemU1s/Rrkf_SPxVLI/AAAAAAAAACw/tsh2umzx8ws/s1600-h/tut9.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp1.blogger.com/_HgervwemU1s/Rrkf_SPxVLI/AAAAAAAAACw/tsh2umzx8ws/s400/tut9.JPG" alt="" id="BLOGGER_PHOTO_ID_5096139625436632242" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:130%;"&gt;&lt;span style=";font-family:times new roman;font-size:100%;"  &gt;The template for the xml file is located in the &lt;a href="http://lucene.apache.org/nutch/tutorial8.html"&gt;Nutch Tutorial&lt;/a&gt;.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 0, 204);"&gt;&lt;span style="font-family:arial;"&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:85%;"&gt;Sidenote: I am using an image because I can't seem to get blogspot to&lt;br /&gt;publish the text without removing the tags. Anyone knows how?&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;Intranet Crawling&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:times new roman;"&gt;(Source:  Nutch Website)&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;p&gt;Once things are configured, running the crawl is easy.  Just use the crawl command.  Its options include:&lt;/p&gt; &lt;ul&gt;&lt;li&gt; &lt;span class="codefrag"&gt;-dir&lt;/span&gt; &lt;em&gt;dir&lt;/em&gt; names the directory to put the crawl in.&lt;/li&gt;&lt;li&gt; &lt;span class="codefrag"&gt;-threads&lt;/span&gt; &lt;em&gt;threads&lt;/em&gt; determines the number of threads that will fetch in parallel.&lt;/li&gt;&lt;li&gt; &lt;span class="codefrag"&gt;-depth&lt;/span&gt; &lt;em&gt;depth&lt;/em&gt; indicates the link depth from the root page that should be crawled.&lt;/li&gt;&lt;li&gt; &lt;span class="codefrag"&gt;-topN&lt;/span&gt; &lt;em&gt;N&lt;/em&gt; determines the maximum number of pages that will be retrieved at each level up to the depth.&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;For example, a typical call might be:&lt;/p&gt; bin/nutch crawl urls -dir crawl -depth 3 -topN 50  &lt;p&gt;Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (&lt;span class="codefrag"&gt;-topN&lt;/span&gt;), and watching the output to check that desired pages are fetched and undesirable pages are not.  Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10.  The number of pages per level (&lt;span class="codefrag"&gt;-topN&lt;/span&gt;) for a full crawl can be from tens of thousands to millions, depending on your resources.&lt;/p&gt; &lt;p&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;Note: During the searching later, the webapp will search the for the folder &lt;span style="font-weight: bold;"&gt;crawl&lt;/span&gt; relative to where Tomcat is started, unless the searcher directory is set (will be covered later).  So for simplicity, you can just place the output of the crawl to &lt;span style="font-weight: bold;"&gt;crawl&lt;/span&gt;.&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;pre class="code"&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="color: rgb(102, 0, 204);"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/892765308925882386-8658368717675623864?l=nutchinstall.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nutchinstall.blogspot.com/feeds/8658368717675623864/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=892765308925882386&amp;postID=8658368717675623864' title='17 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/8658368717675623864'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/8658368717675623864'/><link rel='alternate' type='text/html' href='http://nutchinstall.blogspot.com/2007/08/crawling.html' title='Crawling'/><author><name>GM</name><uri>http://www.blogger.com/profile/11835516323622030621</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp0.blogger.com/_HgervwemU1s/Rrg2TSPxVCI/AAAAAAAAABo/umYvyiiLBMk/s72-c/tut1.JPG' height='72' width='72'/><thr:total>17</thr:total></entry><entry><id>tag:blogger.com,1999:blog-892765308925882386.post-7196375043427441296</id><published>2007-01-01T17:40:00.000-08:00</published><updated>2007-08-07T19:12:30.673-07:00</updated><title type='text'>Searching</title><content type='html'>Searching is also pretty straightforward (on hindsight after spending a few days trying to get it to work), once you realize a few things.&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Searching&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:100%;"&gt;(from Nutch Website)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;p style="text-align: left;"&gt;To search you need to put the nutch war file into your servlet container.  (If instead of downloading a Nutch release you checked the sources out of SVN, then you'll first need to build the war file, with the command &lt;span class="codefrag"&gt;ant war&lt;/span&gt;.)&lt;/p&gt;&lt;div style="text-align: left;"&gt; &lt;/div&gt;&lt;p style="text-align: left;"&gt;Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war file may be installed with the commands:&lt;/p&gt;&lt;div style="text-align: left;"&gt; &lt;pre class="code"&gt;rm -rf ~/local/tomcat/webapps/ROOT*&lt;br /&gt;cp nutch*.war ~/local/tomcat/webapps/ROOT.war&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p style="text-align: left;"&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;Ok, this is the first time that I have used Tomcat or any similar tool which explains the fumbling. Anyway, I finally realized that the above 2 lines are only necessary if you want to start the Nutch search as the default application. So, the only thing you need to do is to copy the .war file to the Tomcat's webapps folder. &lt;/span&gt;&lt;/p&gt;&lt;p style="text-align: left;"&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_HgervwemU1s/RrkUPSPxVEI/AAAAAAAAAB4/5XuADQOeS1Y/s1600-h/tut3.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp1.blogger.com/_HgervwemU1s/RrkUPSPxVEI/AAAAAAAAAB4/5XuADQOeS1Y/s400/tut3.JPG" alt="" id="BLOGGER_PHOTO_ID_5096126706175005762" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p style="text-align: left;"&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;The above simply copies the .war file to the webapps folder without renaming.&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p style="text-align: left;"&gt;&lt;br /&gt;The webapp finds its indexes in &lt;span class="codefrag"&gt;./crawl&lt;/span&gt;, relative to where you start Tomcat, so use a command like:&lt;/p&gt;&lt;div style="text-align: left;"&gt; &lt;pre class="code"&gt;~/local/tomcat/bin/catalina.sh start&lt;br /&gt;&lt;span style="font-family:Georgia,serif;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;If you did not place your crawl contents in &lt;span style="font-weight: bold;"&gt;crawl&lt;/span&gt; folder,&lt;br /&gt;you will need to define the search directory.&lt;br /&gt;&lt;br /&gt;1) First just start the Tomcat.&lt;/span&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_HgervwemU1s/RrkV7SPxVFI/AAAAAAAAACA/muTgziS8tmQ/s1600-h/tut4.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp1.blogger.com/_HgervwemU1s/RrkV7SPxVFI/AAAAAAAAACA/muTgziS8tmQ/s400/tut4.JPG" alt="" id="BLOGGER_PHOTO_ID_5096128561600877650" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;The .war file that you just copied to the&lt;br /&gt;/tomcat_dir/&lt;span style="font-weight: bold;"&gt;webapps &lt;/span&gt;will be automatically&lt;br /&gt;expanded, as evident from the image below. &lt;/span&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_HgervwemU1s/RrkWwCPxVGI/AAAAAAAAACI/kRuqddvWXyY/s1600-h/tut5.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp0.blogger.com/_HgervwemU1s/RrkWwCPxVGI/AAAAAAAAACI/kRuqddvWXyY/s400/tut5.JPG" alt="" id="BLOGGER_PHOTO_ID_5096129467838977122" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;If you have named your .war file to&lt;br /&gt;say, abc.war, then it will expand&lt;br /&gt;to a &lt;span style="font-weight: bold;"&gt;abc&lt;/span&gt; folder.&lt;br /&gt;&lt;br /&gt;2) Amend the &lt;span style="font-weight: bold;"&gt;nutch-site.xml&lt;/span&gt; file in your&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-style: italic;"&gt;tomcat_dir&lt;/span&gt;/webapps/&lt;span style="font-style: italic;"&gt;exanded_dir&lt;/span&gt;/WEB-INF/classes&lt;br /&gt;&lt;/span&gt;folder. So, for my case, I will locate&lt;br /&gt;the file here:&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;img src="file:///C:/DOCUME%7E1/Liew/LOCALS%7E1/Temp/moz-screenshot.jpg" alt="" /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_HgervwemU1s/RrkYkyPxVII/AAAAAAAAACY/B9S2VMU4UQI/s1600-h/tut6.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp3.blogger.com/_HgervwemU1s/RrkYkyPxVII/AAAAAAAAACY/B9S2VMU4UQI/s400/tut6.JPG" alt="" id="BLOGGER_PHOTO_ID_5096131473588704386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;This is the content of my &lt;span style="font-weight: bold;"&gt;nutch-site.xml&lt;/span&gt; file.&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;(note: this nutch-site.xml file is the one&lt;br /&gt;located in the tomcat_dir and not the&lt;br /&gt;one in the nutch_dir folder.)&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_HgervwemU1s/RrkkryPxVMI/AAAAAAAAAC4/jNZXzd8RD8s/s1600-h/tut10.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp3.blogger.com/_HgervwemU1s/RrkkryPxVMI/AAAAAAAAAC4/jNZXzd8RD8s/s400/tut10.JPG" alt="" id="BLOGGER_PHOTO_ID_5096144787987322050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;So you just need to put in the path of your&lt;br /&gt;crawl directory where the indexes and segments&lt;br /&gt;are placed after the crawl.&lt;br /&gt;In this case, my folder is called &lt;span style="font-weight: bold;"&gt;crawl.test3&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;Then visit &lt;a href="http://localhost:8080/"&gt;http://localhost:8080/&lt;/a&gt;&lt;br /&gt;and have fun!&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;Note: If you are using other ports for Tomcat, please use&lt;br /&gt;the corresponding port number.&lt;br /&gt;&lt;br /&gt;For example, if port 8888 is used, the address&lt;br /&gt;will be &lt;span style="font-weight: bold;"&gt;http://localhost:8888&lt;/span&gt;.&lt;br /&gt;This will bring you to the ROOT&lt;br /&gt;application. If you have not changed&lt;br /&gt;anything in the original root folder, then&lt;br /&gt;you will be at the Tomcat start page which looks&lt;br /&gt;like this:&lt;br /&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_HgervwemU1s/Rrkb0CPxVJI/AAAAAAAAACg/djaOqwE9cEU/s1600-h/tut7.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp0.blogger.com/_HgervwemU1s/Rrkb0CPxVJI/AAAAAAAAACg/djaOqwE9cEU/s400/tut7.JPG" alt="" id="BLOGGER_PHOTO_ID_5096135034116592786" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;br /&gt;However, if you have changed the root folder&lt;br /&gt;like this:&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;rm -rf ~/local/tomcat/webapps/ROOT*&lt;br /&gt;cp nutch*.war ~/local/tomcat/webapps/ROOT.war&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;Then, the nutch search page will be the&lt;br /&gt;root application.&lt;br /&gt;&lt;br /&gt;For my example, my nutch search page&lt;br /&gt;is at &lt;span style="font-weight: bold;"&gt;http://localhost:8080/nutch-0.9/&lt;br /&gt;&lt;/span&gt;&lt;span&gt;as shown in the below image.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_HgervwemU1s/RrkccyPxVKI/AAAAAAAAACo/oRWmsX95mUE/s1600-h/tut8.JPG"&gt;&lt;img style="cursor: pointer;" src="http://bp3.blogger.com/_HgervwemU1s/RrkccyPxVKI/AAAAAAAAACo/oRWmsX95mUE/s400/tut8.JPG" alt="" id="BLOGGER_PHOTO_ID_5096135734196262050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255);"&gt;&lt;br /&gt;&lt;br /&gt;Now, you can verify that your search works&lt;br /&gt;by inputing the search queries. If there&lt;br /&gt;are no hits when there should be, maybe&lt;br /&gt;the search directory is not set correctly.&lt;br /&gt;Or the problem may lie with the crawling part.&lt;br /&gt;&lt;br /&gt;Have fun!&lt;br /&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/892765308925882386-7196375043427441296?l=nutchinstall.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://nutchinstall.blogspot.com/feeds/7196375043427441296/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=892765308925882386&amp;postID=7196375043427441296' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/7196375043427441296'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/892765308925882386/posts/default/7196375043427441296'/><link rel='alternate' type='text/html' href='http://nutchinstall.blogspot.com/2007/01/searching.html' title='Searching'/><author><name>GM</name><uri>http://www.blogger.com/profile/11835516323622030621</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://bp1.blogger.com/_HgervwemU1s/RrkUPSPxVEI/AAAAAAAAAB4/5XuADQOeS1Y/s72-c/tut3.JPG' height='72' width='72'/><thr:total>10</thr:total></entry></feed>
