Solr for docs.typo3.org¶

This is work in progress!

Overview:

About Solr
About Nutch
- Nutch Installation
Resources
Solr Indexing Server
- Running Cores
Incoming
- Schema Browser
Nutch
- To do
A problem
t3org in a vagrant box

About Solr ¶

About Nutch ¶

Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

http://nutch.apache.org/ Official website
http://wiki.apache.org/nutch/ Official wiki and tutorials
2015-03-20 Current versions: Apache Nutch 2.3 (src-tar and src-zip only) and 1.9 (src-tar, src-zip, bin-tar and bin-zip)
/usr/local/apache-nutch-for-typo3-2.1.0

Nutch Installation ¶

Nutch Tutorial on Ubuntu (10 easy steps)
Solr+Nutch on Ubuntu Server 10.04 (Lucid)

This HOWTO consists of the following:
1. Installing Solr
2. Installing Nutch
3. Configuring Solr
4. Configuring Nutch
5. Crawling your site
6. Indexing our crawl DB with solr
7. Search the crawled content in Solr

Resources ¶

Solr Indexing Server ¶

Solr indexing server: srv136.typo3.org

Running Cores ¶

http://localhost:8080/solr/t3o_live/admin/ Do not touch!
http://localhost:8080/solr/t3o_deploy/admin/ Do not touch!
http://localhost:8080/solr/t3o_latest/admin/ Use this one! For devopment - you may empty the index.
http://localhost:8080/solr/t3o_testing/admin/ Oops, what’s this?

Incoming ¶

Schema Browser ¶

http://localhost:8080/solr/t3o_latest/admin/schema.jsp

content
URL
type

Solr:

ssh -L 8080:localhost:8080 srv136.typo3.org -N

http://localhost:8080/solr/t3o_latest/admin/

Developmental and test versions of typo3.org:

https://www.latest.dev.t3o.typo3.org/ (srv112)
https://www.latest.dev.t3o.typo3.org/typo3
In the backend: Administrate the solr index: clear index, commit index

Documentation URL: http://docs.typo3.org/documents.txt

Nutch ¶

Patch used by DKD: https://issues.apache.org/jira/browse/NUTCH-978

Tomcat log of Solr:

tail -f /var/log/tomcat6/catalina.out

Configuration file:

cd /usr/local/apache-nutch-for-typo3-2.1.0/urls
nano conf/nutch-site.xml # (like from manual)
nano urls/seed.txt # (like from manual)  <- http://docs.typo3.org/documents.txt
nano conf/nutch-default.xml # ()
nano plugins/parse-html/plugin.xml

ls -l /usr/local/apache-nutch-for-typo3-2.1.0/urls

Create symlink:

ln -s /usr/local/apache-nutch-for-typo3/urls /usr/share/solr/urls

Java PATH:

head -n 70 /etc/init.d/tomcat6

Run Nutch for latest (dev instance):

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
   /usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
   -solr http://localhost:8080/solr/t3o_latest -dir crawl \
   -depth 5 -topN 10
# or -topN 1000

Run Nutch for latest (live instance):

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
   /usr/local/apache-nutch-for-typo3-2.1.0/bin/nutch crawl urls \
   -solr http://localhost:8080/solr/t3o_live -dir crawl \
   -depth 5 -topN 1000

Check the result: http://www.latest.dev.t3o.typo3.org/search/?id=180&L=0&q=TYPO3+Transition+Days

To do ¶

We should try the latest version (2.1.1) of dkd/nutch-typo3-cms as suggested bei Olivier Dobberkau here in the comments.
We should run Nutch with a more recent version of Openjdk or SunJdk.

A problem ¶

JAVA_HOME=/usr/lib/jvm/java-6-openjdk \
>     /usr/local/apache-nutch-for-typo3/bin/nutch crawl urls \
>     -solr http://localhost:8080/solr/t3o_latest -dir crawl \
>     -depth 5 -topN 10
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=http://localhost:8080/solr/t3o_latest
topN = 10
Injector: starting at 2015-03-20 18:00:33
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/mbless/urls
     at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
     at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
     at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
     at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.security.auth.Subject.doAs(Subject.java:416)
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
     at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
     at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

t3org in a vagrant box ¶

http://t3o-solr.dev:8080/solr/t3o/admin/
And here it comes: http://t3org.dev/
Outdated: http://typo3.dev/ Could be removed from the vagrant box.

Solr for docs.typo3.org¶

About Solr ¶

About Nutch ¶

Nutch Installation ¶

Resources ¶

Solr Indexing Server ¶

Running Cores ¶

Incoming ¶

Schema Browser ¶

Nutch ¶

To do ¶

A problem ¶

t3org in a vagrant box ¶

Previous topic

Next topic

Tags

Archives

Languages

Recent Posts

This Page

Solr for docs.typo3.org¶

Quick search

Previous topic

Next topic

This Page