PubMedPDF Tools [v1.09, 2008-09-18]
===============================================

** What it is:

PubMedPDF Tools is a set of scripts for automatically managing a collection
of journal PDF reprints that are indexed by PubMed.

-------------------------------------------------------------------------------
** Description:

If most of the journals you read are indexed by PubMed (www.pubmed.gov),
you can use scripts contained here to manage your collection of PDF
reprint files.  The scripts collate reprint files automatically into
folders based on the author/year and journal/year via the Unix symbolic link(*)
or the Windows shortcut mechanism.

Users deposit reprint files in a directory by naming them like 12345678.pdf,
where the 12345678 part is PubMed ID, a unique ID number which the PubMed system
assigns to each paper.  The scripts automatically look up PubMed and create
symbolic links to these files by using human-readable names based on author
names, journal name, year, etc.

For example, 11090662.pdf is automatically given a journal/year-based link from:
<-- Vision Res/2000/De Valois RL, Cottaris NP, Mahon LE, Elfar SD, Wilson JA; Spatial and temporal receptive..-11090662.pdf

It also receives multiple author/year-based links for all authors of the paper:
<-- /D/De Valois RL/2000/De Valois RL, Cottaris NP, Mahon LE, Elfar SD, Wilson JA; Spatial and temporal receptive..-Vision Res-11090662.pdf
<-- /C/Cottaris NP/2000/De Valois RL, Cottaris NP, Mahon LE, Elfar SD, Wilson JA; Spatial and temporal receptive..-Vision Res-11090662.pdf
  (... more links for the 3 additional authors)

Therefore, unlike a typical paper-based reprint collection for which
you have to decide whose folder to file a reprint, all authors of a given
paper are treated equally in this system; you can find a given paper if you
know the name of any one of its authors. 

See FAQ item 5 below for more information on adding key word search capability
to your reprint collection.

(*) A symbolic link appears and behaves like a copy of the original
file, but in reality it is a link (or a pointer) to the original file.
Therefore, it saves disk space by not duplicating the content of the original
file. (Windows version uses "shortcuts" which work in a manner similar to
Unix symbolic links).


-------------------------------------------------------------------------------
** Download:

 http://www7.bpe.es.osaka-u.ac.jp/pubmedpdf/


-------------------------------------------------------------------------------
** Scripts:

(Windows versions are named

pubmedlinknew.pl -- creates a hierarchy of symbolic links based on journal/year
     or author/year at the top level for all authors for each paper.
     For author/year heirarchy, multiple symbolic links are created for
     all authors pointing to a single PMID.pdf.

updatePDFlinks -- is run via cron (see: "man 1 crontab") and calls the above two
     scripts.  For example, set up this shell script to be run by cron like:

---- an example of crontab content ----
> crontab -l
# updates symbolic links for PMID.pdf files for journal article collection
# use different times in your setup to distribute the load on PubMed
31 5,17 * * * /bin/sh /usr/local/bin/updatePDFlinks


NOTES from PubMed at NCBI (National Center for Biotechnology Information):
Do not overload NCBI's systems. Users intending to send numerous queries and/or
retrieve large numbers of records from Entrez should comply with the following: 

  1. Run retrieval scripts on weekends or between 9 PM and 5 AM ET weekdays
      for any series of more than 100 requests.
  2. Make no more than one request every 3 seconds.
Please note the above when you initially index your collection.


getNewFromXxxxLab (optional module) -- is an optional script for multi-machine
     PDF collection mirroring.  It gets new PDF files not in your collection
     from another lab's web server.


-------------------------------------------------------------------------------
** System Requirements:

(1) Unix with filesystems that have symbolic links.
(2) Perl with LWP::Simple and XML::DOM modules installed.
(3) Access to PubMed.
(4) Cron to run scripts automatically.
Recommeded additional options:
(5) SAMBA to make the PMID directory accessible from Windows clients.
(6) Web server (e.g. apache) set up to allow HTTP access to PMID directory and symbolic links.


-------------------------------------------------------------------------------
** Installation

(For installing the Windows version, please see additional notes in the
Windows directory.)

[1] Create a directory named "PMID" to store your PDF reprint files.  Also
    create the top-level directories "journals" and "authors" for containing
    subdirectories bearing names of journals and authors, respectively.
[2] Install LWP::Simple and XML::DOM modules for Perl (if not already) as follows.
     % sudo perl -MCPAN -e shell
     cpan> install LWP::Simple
     cpan> install XML::DOM
     (Expat XML Parser library, http://sourceforge.net/projects/expat/, may have
      to be installed if not already.  Also, see below if the Perl module
      installations fail.)

[3] Edit pubmedlinknew.pl and the updatePDFlinks shell script to reflect your setup.
[4] Copy pubmedlinknew.pl, and updatePDFlinks to /usr/local/bin/.
[5] "crontab -e" and set up automatic execution of the updatePDFlinks script.
[6] Export the PMID and symbolic link directories via SAMBA to allow access from
    Windows clients.  Client users should have read/write access to PMID directory,
    but read access only to symbolic links.
[7] Setup apache or other web server to allow web access to the PMID directory,
    and symbolic links.

** Optional Installation

The following steps are optional and needed only if you wish to install
multi-lab collection mirroring script, getNewFromXxxxLab:

[8] Edit getNewFromXxxxLab script to reflect your setup, and copy it to
    /usr/local/bin/.
[9] Obtain and install "wget" package from:
    http://www.gnu.org/software/wget/wget.html
[10] Create a special user, say "xxxx_lab"
[11] Set up getNewFromXxxxLab script to be run as user xxxx_lab.

-------------------------------------------------------------------------------
# Possible Problems:


-------------------------------------------------------------------------------
** Tips for Apache Configuration

Symbolic links tend to have long names for the PDF files linked, especially if the
paper has many authors.  It may be helpful to add the following in the apache
configuration file to lengthen the filename field.
---- an example of httpd.conf for apache: auto indexing options ------

# 
# IndexOptions: Controls the appearance of server-generated directory
# listings.
#  
IndexOptions FancyIndexing VersionSort SuppressHTMLPreamble
IndexOptions NameWidth=90 SuppressLastModified SuppressDescription


-------------------------------------------------------------------------------
** FAQ

Q-1: Why do you use PMID instead of DOI (Digital Object Identifier) which
most journals have now adopted?  ( See http://www.doi.org/ )

A-1: DOI was created in 1998.  Old papers do not appear to
have the DOI assigned as of Sep. 2003, whereas PMID exists for nearly
all important papers since the 1950's.  Things may change in the future
when all old papers are assigned DOIs.

Note also that PubMed data (XML) contains DOI for recent records.
Therefore, given a PMID, one can obtain the corresponding DOI easily
if it exists.  I am not sure if the reverse is possible or easy.


Q-2: How do the pubmedlinkxml*.pl scripts work?

A-2: These scripts are written in Perl. Each of the scripts will read the PMID
directory, and for each PDF file found therein, it will do the following.

 (1) Check if <PMID>.xml (XML data file downloaded from PubMed) already exists
     in the .cache directory within the top directory for symlinks.  If it
     already exists for <PMID>, the script assumes that symlinks for that
     paper have already been created, and goes on to processing the next PDF file.

 (2) If <PMID>.xml does not exist in .cache, it will access PubMed and
     obtains the XML-formatted data for <PMID>.

 (3) It parses the XML data, and extracts all the relevant information such
     as the journal name, author names, year of publication, volume number,
     and page range.  If the key data fields are incomplete, it will delete
     the <PMID>.xml from the .cache directory and goes on to the next PDF file.
     (This ensures a retry next time the script is run, until it finds complete
      data.)

 (4) It will create necessary directories (with author name or journal name, and
     the year), and then creates symbolic links.  If there are multiple authors
     for a given article, links to that PDF file will be created for each author.

 (5) It will output information for logging to standard output.  This includes
     PMID, file's owner (person who deposited the PDF file), pathname for the
     symbolic links created.


Q-3: Windows version?

A-3: OBSOLETE NOW. I don't care. - Use old pmxmlwinAU.pl or pmxmlwinJ.pl
and ActivePerl.8.4.810 for Windows (downloadable from www.activestate.com)
and install LWP::Simple and
XML::DOM using the Perl Package Manager.
Please see Windows-spedific installation notes in the Windows directory.


Q-4: What OS have the scripts been tested on?

A-4: FreeBSD (4.7), Debian GNU/Linux, MacOS X.
     The usual difficulty is installing the two Perl modules. If these
     can be installed, the scripts should work on most Unix-like systems.
     Also, Windows version has been tested on Windows XP Pro (SP2).


Q-5: getNewFromXxxxLab script syncs only in one direction. Is it possible to
sync in both directions?

A-5: The getNewFromXxxxLab script is based on the HTTP web protocol
(using Wget) so that you do not need to log into the machine acting as the source.
Bidirectional sync is possible by rsync, but it requrires authentication/login with
the counterpart server.

-------------------------------------------------------------------------------
** Authors and Contributors

Toshihiro Aoyama (RIKEN): author of pubmedlink* scripts.
Tomoyuki Naito (Osaka Univ): author of getNewFromXxxxLab script.
Izumi Ohzawa (Osaka Univ): instigator who begged for a script like pubmedlink*,
                           added some improvements, and wrote this note.
Cameron J. Morland(): relative symbolic links. merged into a single script
		   with option switch. sanity checks, path length limit.
Philipp Sasse (University Bonn): Windows version - converted sctipt to use with Perl for Windows

** Correspondence and improvements should be sent to:
   pubmedpdf@fbs.osaka-u.ac.jp

** Acknowledgements
Ideas for these scripts and the use of PubMed ID for managing reprint
collections have been developed from discussions for improving
Visiome Platform (http://platform.visiome.org/ ), created by Neuroinformatics
for Research in Vision (NRV) Project.  NRV Project has been funded by a special
coordination fund for promoting science from the Ministry of Education, Culture,
Sports, Science and Technology (MEXT) during fiscal 1999-2003.

-------------------------------------------------------------------------------
