SINN
    



    HTTP-Server Configuration

    It is not necessary to run any http-server on the harvest-machine.

    The best idea is to run the Harvest-CGI-Scripts on a separted machine (e.g. the Webserver of your institution). The nph-search.cgi will deal all the communication between the webserver and the Harvest-Broker. For this it will need a file called "Brokers.cf" of the format:

    #NAME                           HOST                                      PORT
    Physics                           harvest.physik.uni-oldenburg.de 8501
    UniOldenburg                   harvest.physik.uni-oldenburg.de 8531
    ...
    
    All files needed for this are located in $HARVEST_HOME/cgi-bin and its subdirectory "lib".


    Just to get an overview, how PhysNet spreads all the programms over several machines. The shown architecture is fast enought to answer 100 queries per minute on an index of approx. 1 million objects.

    RunHarvest

    Harvest should be started now.
    The easiest way to do this is by calling $HARVEST_HOME/RunHarvest.

    This program asks the customer a series of questions about the environment in that Harvest should run - subsequently it starts a Gatherer and a Broker.

    The Progam asks:

    • On which host is the WWW server running? (just type in any answer, doesn't interest any later)
    • On which port is the WWW server running? (same)
    • What should be indicated?
    • The future name of this Harvest to be started.
    • Directory in which the Gatherer should be installed.
    • On which port should Gatherer run?
    • Directory in which the Broker should be installed.
    • On which port should Broker run?
    • Future Broker pass word
    • E-Mail address
    After this, the Gatherer and the Broker will be started automatically.

    Actually, this is very impractical.
    Therefore, the processes should immediately be 'killed' again:

       ps -ef |grep harvest

      find out the PID of the  prozesses gatherd, broker and glimpseserver.

      Remove these prosesses with

      kill -9 (PID#)

    It is better to restart these processes manually at a later time, after the Gatherer and the Broker now put on are individually configured.


    Gatherer Configuration

    The Gatherer collets data that are lying on the WWW-Server.
    The Gatherer should be configured carefully for this reason - especially, in order to prevent that it indicates data that should not be collected.

    RunHarvest creates a .cf-File in the Gatherer directory. By editing this file is it possible to configure the Gatherer.

    A detailed listing of possibilities with examples are found in the Harvest manual; also have a look at the examples in

    $HARVEST_HOME/gatherers/example-...

    Attention:
    The biggest danger is that one of the Gatheres is running out of control. To prevent this, you should always have in mind how the Gatherers work: It starts on a WWW page and is searching in it. Finding a link on this page, it is following it. If the Gatherer is configured 'inconsistently', it jumps into lists in which it has nothing to search for.

    Example: The Gatherer lands on a privat homepage. There is a link to a search engine (e.g. Yahoo). So it jumps from this page to Yahoo and is keeping running...

    That is why it is important to configure the Gatherer carefully, please think about what to indicate exactly.

    An example hostfilter could be:

    Deny xxx
    Deny arXiv.org
    Deny ojps.aip.org
    Deny www.adobe.com
    Deny www.yahoo.de
    Deny www.w3.org
    Deny www.slac.stanford.edu
    Deny lycos
    Deny .com
    Allow .*
    
    This is nearly the default of one of the PhysDep gatherers.

    Broker Configuration

    The Broker is the part of Harvest that accesses the data collected by the Gatherer and that makes an interface for the inquiry avaible for the user.
    The command RunHarvest creates a Broker automatically; however, one can create further brokers which access the same 'database' at any time - the corresponding command is CreateBroker.

    In $HARVEST_HOME/brokers/BROKER/admin/broker.conf you should change the line:
    GlimpseIndex-Flags -n
    into one, which allows Glimpse to use more then the default 10 MByte of memory:
    GlimpseIndex-Flags -n -B -M 100
    Will run much faster on index, because of building bigger tables (-b) and using 100 MByte (-M 100) of memory.

    There is still a border in glimpseindex-program. If the size of all indexed data increases 1 GByte the needed memory will increase quite fast, so just test (-M 300 ... -M 2000 ...) but be aware not to use more memory then available (physically plus swap).
    If it does not run anyhow but crashes down in core, it would be a good idea to separate the glimpse-index from the rest of the broker and to start it via the crontab-schedule:
    $HARVEST_HOME/lib/broker/glimpseindex -b -T -B -M 1000 -n -H $HARVEST_HOME/brokers/BROKER $HARVEST_HOME/brokers/BROKER/objects

    Please do not forget to restart the glimpse-server after rebuilding the glimpse-index (just kill the process)!

    CGI-DIR/Brokers.cf: Add any new Broker into this file manually.

    Files worth to know:

    • $HARVEST_HOME/brokers/BROKER/admin/broker.conf: Contains the configuration of the Broker- and also the pass word. Should be protected against inadmissible access from the outside.
    • $HARVEST_HOME/brokers/BROKER/admin/Collection.conf: Here one can give those Gatherers and/or Brokers, which the Broker will collect its data from.