Mar 20, 2015
This is work in progress!
Overview:
Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.
Solr+Nutch on Ubuntu Server 10.04 (Lucid)
This HOWTO consists of the following:
Solr indexing server: srv136.typo3.org
http://localhost:8080/solr/t3o_latest/admin/schema.jsp
Solr:
ssh -L 8080:localhost:8080 srv136.typo3.org -N
http://localhost:8080/solr/t3o_latest/admin/
Developmental and test versions of typo3.org:
Documentation URL: http://docs.typo3.org/documents.txt
Patch used by DKD: https://issues.apache.org/jira/browse/NUTCH-978
Tomcat log of Solr:
tail -f /var/log/tomcat6/catalina.out
Configuration file:
cd /usr/local/apache-nutch-for-typo3-2.1.0/urls
nano conf/nutch-site.xml # (like from manual)
nano urls/seed.txt # (like from manual) <- http://docs.typo3.org/documents.txt
nano conf/nutch-default.xml # ()
nano plugins/parse-html/plugin.xml
ls -l /usr/local/apache-nutch-for-typo3-2.1.0/urls
Create symlink:
ln -s /usr/local/apache-nutch-for-typo3/urls /usr/share/solr/urls
Java PATH:
head -n 70 /etc/init.d/tomcat6
Run Nutch for latest (dev instance):
JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
/usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
-solr http://localhost:8080/solr/t3o_latest -dir crawl \
-depth 5 -topN 10
# or -topN 1000
Run Nutch for latest (live instance):
JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
/usr/local/apache-nutch-for-typo3-2.1.0/bin/nutch crawl urls \
-solr http://localhost:8080/solr/t3o_live -dir crawl \
-depth 5 -topN 1000
Check the result: http://www.latest.dev.t3o.typo3.org/search/?id=180&L=0&q=TYPO3+Transition+Days
JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
> /usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
> -solr http://localhost:8080/solr/t3o_latest -dir crawl \
> -depth 5 -topN 10
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=http://localhost:8080/solr/t3o_latest
topN = 10
Injector: starting at 2015-03-20 18:00:33
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/mbless/urls
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)