Apache Nutch project, as you all may know is a very good web crawler out there using which you can set-up your own crawling systems using which you can proceed to setup a search tool for your website that works exactly as Google. I am assuming that you have already downloaded and setup nutch on your system. If not, you can follow this tutorial in order to find out how to install and setup Nutch crawling.
As you may know that a crawl job for Nutch can be started using the following command:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Notice the two parameters that are passed in the command, these are:
In this article we will talk more about these parameters and see what are the optimum values of these parameters using which you can crawl your website or any other website.
The depth parameter essentially tells us how many times should nutch crawl the web page in a single run. This means when you, say for instance, start a nutch crawl job with a value of depth as 3, then the crawl job will repeat 3 times in order to crawl more URLs and fetch maximum number of documents.
The topN parameter decides how many pages nutch should crawl per depth. This means, if you estimate a website to have 3000 pages then you can specify a depth value of 3 and a topN value of 1000 for a successful crawl of 3000 documents.
However, this is just theoretical and while putting above in practice you might notice that not even 50% of your set target number of URLs have been crawled. These essentially happen when the URLs timeout or they cant serve content to the spider currently. Choosing optimum values for the above two parameters, depth and topN can be finalised only by trial and error as both of these depend on the size of the website being crawled.
For example, take a look at the table below:
|depth||topN||Expected documents to be crawled|