package Augment;

# ====================================================================
# The Apache Software License, Version 1.1
#
# Copyright (c) 2000 Bootstrap Institute.  All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#
# 1. Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in
#    the documentation and/or other materials provided with the
#    distribution.
#
# 3. The end-user documentation included with the redistribution,
#    if any, must include the following acknowledgment:
#       "This product includes software developed by the
#        Bootstrap Institute (http://www.bootstrap.org/)."
#    Alternately, this acknowledgment may appear in the software itself,
#    if and wherever such third-party acknowledgments normally appear.
#
# 4. The names "Open Hyperdocument System" and "Bootstrap Institute"
#    must not be used to endorse or promote products derived from 
#    this software without prior written permission. For written
#    permission, please contact info@bootstrap.org.
#
# 5. Products derived from this software may not be called "Open
#    Hyperdocument System", nor may "Open Hyperdocument System"
#    appear in their name, without prior written permission of the
#    Bootstrap Institute.
#
# THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
# WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED.  IN NO EVENT SHALL THE BOOTSTRAP INSTITUTE OR ITS
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
# ====================================================================
#
# This software consists of voluntary contributions made by many
# individuals on behalf of the Bootstrap Institute.  For more
# information on the Bootstrap Institute, please see
# <http://www.bootstrap.org/>.

=head1 NAME

Augment.pm - class for accessing/converting Augment files

=head1 SYNOPSIS

  use Augment;

  my $a = new Augment 'file.txt';
  $a->writeHTML;
  print "file.txt converted and written to " . $a->getHTMLfilename;
  $a->destroy;

=head1 DESCRIPTION

Parses exported Augment file (generated by Doug Engelbart from his
Augment system) into a Perl object for conversion into other formats.
Currently only exports HTML.

=head1 EXPORTED AUGMENT FILE

The exported Augment file contains formatted ASCII text that is
automatically generated by one of Doug's Augment scripts, with some
manual tweaking here and there.  Doug's script does the following.

First, it includes the Augment filename and the date of the file in
the first and third line of the exported file respectively.
Directories are delimited by commas, with the exception of the
trailing comma, which can be used to add addressing information.  For
example:

    AUGMENT,132505,

refers to the file 132505 in the AUGMENT directory.

Second, it formats Augment addressing information (hierarchical
address, statement ID (SID), and optional label) and prepends it to
the statement.  The resulting string is delimited by colons, and is
separated from the statement by a pound sign.  For example:

    12B:0177:Reference-2#Reference-2: A reference.

The string preceding the pound sign is the address string, with an
hierarchical address is 12B, a statement ID of 0177, and the label
"Reference-2".

Hierarchical addresses could easily be computed by a2h.pl, but since
that information is already in Augment, we figured we might as well
take advantage of it.  Augment statement IDs are always prefixed by a
zero.  The third field, the label, is optional.

Third, it indents statements by multiples of three spaces and
word-wraps them at 80 columns.  It typically separates statements by a
blank line.  However, Augment does not treat newlines as statement
delimiters.  In other words, you cannot assume that paragraphs
separated by blank lines are unique statements.  Because of this
behavior, the only way to identify a unique statement in an exported
Augment file is by the address string that precedes it.

Some Augment files include directives meant primarily for printed
output.  They generally look something like:

    .SnfShow=Off;

Doug's script is supposed to turn these off before outputting text,
but it doesn't look like he's made that modification yet, so this
script tries to identify and strip these directives itself.

This is what an exported Augment file might look like:

    0:01:#Sample Augment file

    1:02:Introduction#Introduction

       1A:03:#This is the introductory paragraph.

       1B:04:#This is the second paragraph

=head1 DATA STRUCTURE

a2h.pl parses the file into a hash that stores some metadata and a
reference to the parse tree.  The keys to the hash are:

    'filename' => Augment filename,
    'date' => publication date
    'statements' => reference to parse tree

The parse tree consists of a reference to an array of references
pointing either to a statement hash or to another array of references.
The statement hash contains the following:

    $statement = {
        'address' => hierarchical address,
        'sid' => statement ID,
        'label' => optional label,
        'data' => statement data
    };

The following is a populated $statements data structure corresponding
to the sample exported Augment file shown above.

    $statements = [
        { 'address' => '0',
          'sid' => '01',
          'label' => '',
          'data' => 'Sample Augment file'
        },
        { 'address' => '1',
          'sid' => '02',
          'label' => 'Introduction',
          'data' => 'Introduction'
        },
        [
            { 'address' => '1A',
              'sid' => '03',
              'label' => '',
              'data' => 'This is the introductory paragraph.'
            },
            { 'address' => '1B',
              'sid' => '04',
              'label' => '',
              'data' => 'This is the second paragraph.'
            },
        ]
    ];

=cut

use IO::File;
use strict;

### configuration variables
my $_HTML_FILES_DIR = '/home/eekim/www/apache/htdocs/augment';


### public methods

=head1 METHODS

=head2 new($fname, %options)

 pre:
   $fname - name of exported Augment file
   $options{'html_files_dir'} - directory where converted HTML files go.
       If not specified, set to $_HTML_FILES_DIR.

 post:
   none

Augment constructor.  Reads an exported Augment file, and parses it
into an internal data structure.

=cut

sub new {
  my $this = shift;
  my ($fname, %options) = @_;

  if ($options{'html_files_dir'} && -d $options{'html_files_dir'}) {
    $_HTML_FILES_DIR = $options{'html_files_dir'};
  }
  my $self = _parse_file($fname);
  bless($self, $this);
  return $self;
}

=head2 writeHTML($a_fname)

 pre:
   $a_fname - Augment filename.

 post:
   none

Generates HTML from the parse tree, converts Augment filename to HTML
filename, and writes the HTML file.

=cut

sub writeHTML {
  my $this = shift;
  my $c_fname = _convert_filename($this->{'filename'});

  _create_subdirectories($c_fname);
  my $fh = new IO::File ">$_HTML_FILES_DIR/$c_fname";
  &_html_header($fh, $this->{'filename'}, $this->{'date'});
  &_statements_to_html($this->{'statements'}, $fh, 1);
  &_html_footer($fh);
  $fh->close;
}

=head2 getHTMLfilename

 pre:
   none

 post:
   none

Returns the corresponding HTML filename of an Augment file.

=cut

sub getHTMLfilename {
  my $this = shift;

  return _convert_filename($this->{'filename'});
}

=head2 destroy

 pre:
   none

 post:
   none

Augment destructor.  Undefines the Augment object, thus allowing the
garbage collector to free its memory.

=cut

sub destroy {
  my $this = shift;

  undef $this;
}

### private methods

=head1 PRIVATE METHODS

=head2 _parse_file($input_file)

 pre:
   $input_file - name of exported Augment file to convert.

 post:
   \%parse_tree - statements + metadata (described above).

Parses the input file, and returns metadata about the file, and the
parse tree populated with data from the parsed file.

=cut

sub _parse_file {
  my $input_file = shift;
  my ($line, $location_string, $current_indent_level, $indent_level);
  my ($statements, $statement, %parse_tree, $current_branch, @branches);

  my $fh = new IO::File "$input_file", "r";
  # read metadata in first few lines
  $parse_tree{'filename'} = <$fh>;
  <$fh>; # ignore blank line
  $parse_tree{'date'} = <$fh>;
  <$fh>; # ignore blank line
  # parse main data into @statements
  $current_indent_level = 0;
  $statements = [];
  $current_branch = $statements;
  push @branches, $current_branch;
  undef $statement;
  while ($line = <$fh>) {
    # move statement metadata to $1, if it exists
    $line =~ s/(^\s*.+:0[0-9]+:[^\#]*\#)//;
    $location_string = $1;
    if ($location_string) { # first line of new statement
      # insert previous statement into appropriate place in @statements
      if (defined $statement) {
        ### BEGIN filtering rules for $statement->{'data'}
        # remove Augment directives
        $line =~ s/\.[^=]+=[^\;]+\;//g;
        # search and replace HTML entities
        $line =~ s/\&/\&amp\;/g;
        $line =~ s/</\&lt\;/g;
        $line =~ s/>/\&gt\;/g;
        # possibly bold first sentence
        $line =~ s/^([A-Z\s\-]+\.)/<b>$1<\/b>/;
        ### END filtering rules for $statement->{'data'}
        if ($indent_level == $current_indent_level) {
          push @{$current_branch}, $statement;
        }
        elsif ($indent_level > $current_indent_level) {
          while ($indent_level > $current_indent_level) {
            push @branches, $current_branch;
            push @{$current_branch}, [];
            $current_branch = $current_branch->[$#$current_branch];
            $current_indent_level++;
          }
          push @{$current_branch}, $statement;
        }
        else {
          while ($indent_level < $current_indent_level) {
            $current_branch = pop @branches;
            $current_indent_level--;
          }
          push @{$current_branch}, $statement;
        }
        $current_indent_level = $indent_level;
        undef $statement;
      }
      # parse metadata
      $location_string =~ s/\#$//;
      $location_string =~ s/(^\s*)//;
      ### Indent_level can be determined either by parsing the hierarchical address,
      ### or by measuring the number of spaces the statement is indented.  I chose
      ### the latter method, because I'm lazy.
      $indent_level = length($1) / 3;
      ($statement->{'address'}, $statement->{'sid'}, $statement->{'label'}) =
        split(/:/, $location_string);
    }
    # clean up $line and append it to $statement->{'data'}
    $line =~ s/^\s*//; # delete all whitespace at beginning of line
    $line =~ s/\s*$/ /; # replace all whitespace at end of line with one space
    $statement->{'data'} .= $line;
  }
  # add last $statement to $statements
  $current_branch = $statements;
  while ($current_indent_level > 0) {
    $current_branch = $current_branch->[$#$current_branch];
    $current_indent_level--;
  }
  push @{$current_branch}, $statement;
  $fh->close;
  $parse_tree{'statements'} = $statements;
  return \%parse_tree;
}

=head2 _statements_to_html($statements, $fh, $indent_level)

 pre:
   $statements - parse tree (w/o root metadata)
   $fh - file handle to where converted HTML is printed.
   $indent_level - current indentation level.  Used to determine which
       HTML header style to print.

 post:
   none

Recursive function that traverses the parse tree and prints HTML
statements with appropriate addresses, indentation, and other special
formatting (such as the infamous "purple" numbers).

=cut

sub _statements_to_html {
  my ($statements, $fh, $indent_level) = @_;
  my ($statement);

  foreach $statement (@{$statements}) {
    if (ref $statement eq 'HASH') { # print statement
      if ($statement->{'data'} eq uc($statement->{'data'})) {
        print $fh "<h$indent_level>";
      }
      else {
        print $fh '<p>';
      }
      print $fh '<a name="' . $statement->{'address'} . '"></a>';
      print $fh '<a name="' . $statement->{'sid'} . '"></a>';
      print $fh '<a name="' . $statement->{'label'} . '"></a>'
        unless (!$statement->{'label'});
      print $fh $statement->{'data'};
      # purple numbers
      if ($statement->{'address'} ne '0') {
        print $fh '<span class="citation"><a href="#';
        print $fh $statement->{'address'} . '">';
        print $fh $statement->{'address'} . '</a></span>';
      }
      if ($statement->{'data'} eq uc($statement->{'data'})) {
        print $fh "</h$indent_level>\n\n";
      }
      else {
        print $fh '</p>' . "\n\n";
      }
    }
    elsif (ref $statement eq 'ARRAY') {
      # indent...
      print $fh '<div class="indented">' . "\n\n";
      # ... recurse...
      &_statements_to_html($statement, $fh, $indent_level + 1);
      # ... and unindent...
      print $fh '</div>' . "\n\n";
    }
  }
}

=head2 _html_header($fh, $fname, $date)

 pre:
   $fh - file handle to where converted HTML is printed.
   $fname - name of Augment file
   $date - date of Augment file

 post:
   none

Prints the HTML header tags with embedded stylesheet and other
metadata.

=cut

sub _html_header {
  my ($fh, $fname, $date) = @_;

  print $fh <<EOM;
<html>
<head>
<title>$fname -- $date</title>
<style type="text/css">
<!--
    SPAN.citation { color: #C100C1;
                    font-size: smaller;
                    font-weight: bold;
                    font-style: italic }
    DIV.indented { margin-left: 3em }
-->
</style>
</head>

<body bgcolor="#FFFFFF">
EOM
}

=head2 _html_footer($fh)

 pre:
   $fh - file handle to where converted HTML is printed.

 post:
   none

Prints the HTML footer tags.

=cut

sub _html_footer {
  my $fh = shift;
  print $fh <<EOM;
</body>
</html>
EOM
}

=head2 _convert_filename($a_fname)

 pre:
   $a_fname - original Augment filename

 post:
   $a_fname - converted filename

Converts an Augment filename to a Web/UNIX-friendly filename by
replacing the trailing comma with '.html' and all other commas with
forward slashes.

=cut

sub _convert_filename {
  my $a_fname = shift;

  chomp $a_fname;
  # replace last comma with .html
  $a_fname =~ s/,\s*$/.html/;
  # replace commas with forward slashes
  $a_fname =~ s/,/\//g;
  return $a_fname;
}

=head2 _create_subdirectories($path)

 pre:
   $path - fully qualified UNIX path and filename

 post:
   none

Creates the appropriate directories if they do not already exist.

=cut

sub _create_subdirectories {
  my $path = shift;
  my (@dirs, $dir);

  @dirs = split(/\//, $path);
  # remove filename
  pop(@dirs);
  $path = $_HTML_FILES_DIR . '/';
  foreach $dir (@dirs) {
    $path .= $dir;
    if (!-d $path) {
      mkdir($path, 0755);
    }
    $path .= '/';
  }
}

1;

=head1 TO DO

=head2 Better Exception Handling

Augment.pm does very little error handling.  This is a bad thing.

=head2 Better Object-Orientation

It might be nice to have a generic write() method and to separate the
HTML-related methods into a subclass.  Then, anytime someone wanted to
write a new conversion module, that person could just subclass
Augment, and overload write().

There's not a great need for this.  It's fairly straightforward to add
new methods to the class as it currently stands.

=head2 Generic Manipulation Methods

It might be nice to have some generic functions for manipulating the
parse tree, perhaps a traverse() method.  However, as I said before,
there's no great need for this right now; time is better spent on
other areas, especially the ones listed below.

=head2 Augment Links

This version of a2h.pl does not convert Augment links to HTML links.
This is nontrivial for a number of reasons.  Syntactically, any text
in an Augment file delimited by parentheses or angle brackets is
potentially a link.  (At some point, Doug's Augment team standardized
on angle brackets for their link format, but some documents still use
parentheses.)  In Augment, if you pointed to some text so delimited
and tried to jump to that location, if it were a valid link
(i.e. entry in the link database), Augment would go there; otherwise,
Augment would just ignore the command.

In order to do Augment link conversion, this script should assemble
all of the legal addresses within this document and store them in a
link database.  It should then identify anything that looks like a
link, and search the database for such a link.  If that link exists,
then it should create the appropriate HTML link.

An additional challenge is that Augment had a number of linking
semantics not supported by HTML links, such as sophisticated
addressing and indirect links.  These can be mapped to the XML XLink
specification fairly easily, but determining how to map these XLinks
to HTML links is a nontrivial problem.

Both of the above issues are opportunities for synergy with the main
OHS development.  For example, we could use the OHS link database
specification to generate a database of Augment links.  We could also
use the OHS XML->HTML transcoder to determine how XLink links are
converted to HTML links.  Of course, both of these components are
currently non-existent.

=head2 Improved parsing

Augment had a fairly generic markup language that did not specify
things such as headlines, lists, tables, etc.  It would be nice to
develop a more sophisticated set of rules that did a better job of
deciding whether something should be an HTML list or table.

=head2 Augment->XML

This script was developed primarily as a quick and dirty way to let
Doug post old and new Augment documents on the Web in an addressable
manner.  Eventually, this script should convert Augment files to XML,
which could then be transcoded into HTML.  Once an appropriate DTD is
developed, this should be fairly trivial, because the Augment file is
converted into an intermediate parse tree that could easily be used to
generate all sorts of output.

=head1 HISTORY

Shinya Yamada <shinya@bootstrap.org> wrote the first Augment->HTML
convertor in Java, and released it on August 20, 2000.  Doug Engelbart
<doug@bootstrap.org> made changes to his export script and suggested
improvements to Shinya's work, which led to this rewrite of the
convertor in Perl.

I released the first version of a2h.pl on October 6, 2000.  On October
9, 2000, I rewrote and released Augment.pm, an object-oriented version
of the appropriate a2h.pl functions.

=head1 AUTHOR

Eugene Eric Kim <eekim@eekim.com>

=cut