package Augment; # ==================================================================== # The Apache Software License, Version 1.1 # # Copyright (c) 2000 Bootstrap Institute. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # # 1. Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # 2. Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in # the documentation and/or other materials provided with the # distribution. # # 3. The end-user documentation included with the redistribution, # if any, must include the following acknowledgment: # "This product includes software developed by the # Bootstrap Institute (http://www.bootstrap.org/)." # Alternately, this acknowledgment may appear in the software itself, # if and wherever such third-party acknowledgments normally appear. # # 4. The names "Open Hyperdocument System" and "Bootstrap Institute" # must not be used to endorse or promote products derived from # this software without prior written permission. For written # permission, please contact info@bootstrap.org. # # 5. Products derived from this software may not be called "Open # Hyperdocument System", nor may "Open Hyperdocument System" # appear in their name, without prior written permission of the # Bootstrap Institute. # # THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL THE BOOTSTRAP INSTITUTE OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. # ==================================================================== # # This software consists of voluntary contributions made by many # individuals on behalf of the Bootstrap Institute. For more # information on the Bootstrap Institute, please see # . =head1 NAME Augment.pm - class for accessing/converting Augment files =head1 SYNOPSIS use Augment; my $a = new Augment 'file.txt'; $a->writeHTML; print "file.txt converted and written to " . $a->getHTMLfilename; $a->destroy; =head1 DESCRIPTION Parses exported Augment file (generated by Doug Engelbart from his Augment system) into a Perl object for conversion into other formats. Currently only exports HTML. =head1 EXPORTED AUGMENT FILE The exported Augment file contains formatted ASCII text that is automatically generated by one of Doug's Augment scripts, with some manual tweaking here and there. Doug's script does the following. First, it includes the Augment filename and the date of the file in the first and third line of the exported file respectively. Directories are delimited by commas, with the exception of the trailing comma, which can be used to add addressing information. For example: AUGMENT,132505, refers to the file 132505 in the AUGMENT directory. Second, it formats Augment addressing information (hierarchical address, statement ID (SID), and optional label) and prepends it to the statement. The resulting string is delimited by colons, and is separated from the statement by a pound sign. For example: 12B:0177:Reference-2#Reference-2: A reference. The string preceding the pound sign is the address string, with an hierarchical address is 12B, a statement ID of 0177, and the label "Reference-2". Hierarchical addresses could easily be computed by a2h.pl, but since that information is already in Augment, we figured we might as well take advantage of it. Augment statement IDs are always prefixed by a zero. The third field, the label, is optional. Third, it indents statements by multiples of three spaces and word-wraps them at 80 columns. It typically separates statements by a blank line. However, Augment does not treat newlines as statement delimiters. In other words, you cannot assume that paragraphs separated by blank lines are unique statements. Because of this behavior, the only way to identify a unique statement in an exported Augment file is by the address string that precedes it. Some Augment files include directives meant primarily for printed output. They generally look something like: .SnfShow=Off; Doug's script is supposed to turn these off before outputting text, but it doesn't look like he's made that modification yet, so this script tries to identify and strip these directives itself. This is what an exported Augment file might look like: 0:01:#Sample Augment file 1:02:Introduction#Introduction 1A:03:#This is the introductory paragraph. 1B:04:#This is the second paragraph =head1 DATA STRUCTURE a2h.pl parses the file into a hash that stores some metadata and a reference to the parse tree. The keys to the hash are: 'filename' => Augment filename, 'date' => publication date 'statements' => reference to parse tree The parse tree consists of a reference to an array of references pointing either to a statement hash or to another array of references. The statement hash contains the following: $statement = { 'address' => hierarchical address, 'sid' => statement ID, 'label' => optional label, 'data' => statement data }; The following is a populated $statements data structure corresponding to the sample exported Augment file shown above. $statements = [ { 'address' => '0', 'sid' => '01', 'label' => '', 'data' => 'Sample Augment file' }, { 'address' => '1', 'sid' => '02', 'label' => 'Introduction', 'data' => 'Introduction' }, [ { 'address' => '1A', 'sid' => '03', 'label' => '', 'data' => 'This is the introductory paragraph.' }, { 'address' => '1B', 'sid' => '04', 'label' => '', 'data' => 'This is the second paragraph.' }, ] ]; =cut use IO::File; use strict; ### configuration variables my $_HTML_FILES_DIR = '/home/eekim/www/apache/htdocs/augment'; ### public methods =head1 METHODS =head2 new($fname, %options) pre: $fname - name of exported Augment file $options{'html_files_dir'} - directory where converted HTML files go. If not specified, set to $_HTML_FILES_DIR. post: none Augment constructor. Reads an exported Augment file, and parses it into an internal data structure. =cut sub new { my $this = shift; my ($fname, %options) = @_; if ($options{'html_files_dir'} && -d $options{'html_files_dir'}) { $_HTML_FILES_DIR = $options{'html_files_dir'}; } my $self = _parse_file($fname); bless($self, $this); return $self; } =head2 writeHTML($a_fname) pre: $a_fname - Augment filename. post: none Generates HTML from the parse tree, converts Augment filename to HTML filename, and writes the HTML file. =cut sub writeHTML { my $this = shift; my $c_fname = _convert_filename($this->{'filename'}); _create_subdirectories($c_fname); my $fh = new IO::File ">$_HTML_FILES_DIR/$c_fname"; &_html_header($fh, $this->{'filename'}, $this->{'date'}); &_statements_to_html($this->{'statements'}, $fh, 1); &_html_footer($fh); $fh->close; } =head2 getHTMLfilename pre: none post: none Returns the corresponding HTML filename of an Augment file. =cut sub getHTMLfilename { my $this = shift; return _convert_filename($this->{'filename'}); } =head2 destroy pre: none post: none Augment destructor. Undefines the Augment object, thus allowing the garbage collector to free its memory. =cut sub destroy { my $this = shift; undef $this; } ### private methods =head1 PRIVATE METHODS =head2 _parse_file($input_file) pre: $input_file - name of exported Augment file to convert. post: \%parse_tree - statements + metadata (described above). Parses the input file, and returns metadata about the file, and the parse tree populated with data from the parsed file. =cut sub _parse_file { my $input_file = shift; my ($line, $location_string, $current_indent_level, $indent_level); my ($statements, $statement, %parse_tree, $current_branch, @branches); my $fh = new IO::File "$input_file", "r"; # read metadata in first few lines $parse_tree{'filename'} = <$fh>; <$fh>; # ignore blank line $parse_tree{'date'} = <$fh>; <$fh>; # ignore blank line # parse main data into @statements $current_indent_level = 0; $statements = []; $current_branch = $statements; push @branches, $current_branch; undef $statement; while ($line = <$fh>) { # move statement metadata to $1, if it exists $line =~ s/(^\s*.+:0[0-9]+:[^\#]*\#)//; $location_string = $1; if ($location_string) { # first line of new statement # insert previous statement into appropriate place in @statements if (defined $statement) { ### BEGIN filtering rules for $statement->{'data'} # remove Augment directives $line =~ s/\.[^=]+=[^\;]+\;//g; # search and replace HTML entities $line =~ s/\&/\&\;/g; $line =~ s//\>\;/g; # possibly bold first sentence $line =~ s/^([A-Z\s\-]+\.)/$1<\/b>/; ### END filtering rules for $statement->{'data'} if ($indent_level == $current_indent_level) { push @{$current_branch}, $statement; } elsif ($indent_level > $current_indent_level) { while ($indent_level > $current_indent_level) { push @branches, $current_branch; push @{$current_branch}, []; $current_branch = $current_branch->[$#$current_branch]; $current_indent_level++; } push @{$current_branch}, $statement; } else { while ($indent_level < $current_indent_level) { $current_branch = pop @branches; $current_indent_level--; } push @{$current_branch}, $statement; } $current_indent_level = $indent_level; undef $statement; } # parse metadata $location_string =~ s/\#$//; $location_string =~ s/(^\s*)//; ### Indent_level can be determined either by parsing the hierarchical address, ### or by measuring the number of spaces the statement is indented. I chose ### the latter method, because I'm lazy. $indent_level = length($1) / 3; ($statement->{'address'}, $statement->{'sid'}, $statement->{'label'}) = split(/:/, $location_string); } # clean up $line and append it to $statement->{'data'} $line =~ s/^\s*//; # delete all whitespace at beginning of line $line =~ s/\s*$/ /; # replace all whitespace at end of line with one space $statement->{'data'} .= $line; } # add last $statement to $statements $current_branch = $statements; while ($current_indent_level > 0) { $current_branch = $current_branch->[$#$current_branch]; $current_indent_level--; } push @{$current_branch}, $statement; $fh->close; $parse_tree{'statements'} = $statements; return \%parse_tree; } =head2 _statements_to_html($statements, $fh, $indent_level) pre: $statements - parse tree (w/o root metadata) $fh - file handle to where converted HTML is printed. $indent_level - current indentation level. Used to determine which HTML header style to print. post: none Recursive function that traverses the parse tree and prints HTML statements with appropriate addresses, indentation, and other special formatting (such as the infamous "purple" numbers). =cut sub _statements_to_html { my ($statements, $fh, $indent_level) = @_; my ($statement); foreach $statement (@{$statements}) { if (ref $statement eq 'HASH') { # print statement if ($statement->{'data'} eq uc($statement->{'data'})) { print $fh ""; } else { print $fh '

'; } print $fh ''; print $fh ''; print $fh '' unless (!$statement->{'label'}); print $fh $statement->{'data'}; # purple numbers if ($statement->{'address'} ne '0') { print $fh ''; print $fh $statement->{'address'} . ''; } if ($statement->{'data'} eq uc($statement->{'data'})) { print $fh "\n\n"; } else { print $fh '

' . "\n\n"; } } elsif (ref $statement eq 'ARRAY') { # indent... print $fh '
' . "\n\n"; # ... recurse... &_statements_to_html($statement, $fh, $indent_level + 1); # ... and unindent... print $fh '
' . "\n\n"; } } } =head2 _html_header($fh, $fname, $date) pre: $fh - file handle to where converted HTML is printed. $fname - name of Augment file $date - date of Augment file post: none Prints the HTML header tags with embedded stylesheet and other metadata. =cut sub _html_header { my ($fh, $fname, $date) = @_; print $fh < $fname -- $date EOM } =head2 _html_footer($fh) pre: $fh - file handle to where converted HTML is printed. post: none Prints the HTML footer tags. =cut sub _html_footer { my $fh = shift; print $fh < EOM } =head2 _convert_filename($a_fname) pre: $a_fname - original Augment filename post: $a_fname - converted filename Converts an Augment filename to a Web/UNIX-friendly filename by replacing the trailing comma with '.html' and all other commas with forward slashes. =cut sub _convert_filename { my $a_fname = shift; chomp $a_fname; # replace last comma with .html $a_fname =~ s/,\s*$/.html/; # replace commas with forward slashes $a_fname =~ s/,/\//g; return $a_fname; } =head2 _create_subdirectories($path) pre: $path - fully qualified UNIX path and filename post: none Creates the appropriate directories if they do not already exist. =cut sub _create_subdirectories { my $path = shift; my (@dirs, $dir); @dirs = split(/\//, $path); # remove filename pop(@dirs); $path = $_HTML_FILES_DIR . '/'; foreach $dir (@dirs) { $path .= $dir; if (!-d $path) { mkdir($path, 0755); } $path .= '/'; } } 1; =head1 TO DO =head2 Better Exception Handling Augment.pm does very little error handling. This is a bad thing. =head2 Better Object-Orientation It might be nice to have a generic write() method and to separate the HTML-related methods into a subclass. Then, anytime someone wanted to write a new conversion module, that person could just subclass Augment, and overload write(). There's not a great need for this. It's fairly straightforward to add new methods to the class as it currently stands. =head2 Generic Manipulation Methods It might be nice to have some generic functions for manipulating the parse tree, perhaps a traverse() method. However, as I said before, there's no great need for this right now; time is better spent on other areas, especially the ones listed below. =head2 Augment Links This version of a2h.pl does not convert Augment links to HTML links. This is nontrivial for a number of reasons. Syntactically, any text in an Augment file delimited by parentheses or angle brackets is potentially a link. (At some point, Doug's Augment team standardized on angle brackets for their link format, but some documents still use parentheses.) In Augment, if you pointed to some text so delimited and tried to jump to that location, if it were a valid link (i.e. entry in the link database), Augment would go there; otherwise, Augment would just ignore the command. In order to do Augment link conversion, this script should assemble all of the legal addresses within this document and store them in a link database. It should then identify anything that looks like a link, and search the database for such a link. If that link exists, then it should create the appropriate HTML link. An additional challenge is that Augment had a number of linking semantics not supported by HTML links, such as sophisticated addressing and indirect links. These can be mapped to the XML XLink specification fairly easily, but determining how to map these XLinks to HTML links is a nontrivial problem. Both of the above issues are opportunities for synergy with the main OHS development. For example, we could use the OHS link database specification to generate a database of Augment links. We could also use the OHS XML->HTML transcoder to determine how XLink links are converted to HTML links. Of course, both of these components are currently non-existent. =head2 Improved parsing Augment had a fairly generic markup language that did not specify things such as headlines, lists, tables, etc. It would be nice to develop a more sophisticated set of rules that did a better job of deciding whether something should be an HTML list or table. =head2 Augment->XML This script was developed primarily as a quick and dirty way to let Doug post old and new Augment documents on the Web in an addressable manner. Eventually, this script should convert Augment files to XML, which could then be transcoded into HTML. Once an appropriate DTD is developed, this should be fairly trivial, because the Augment file is converted into an intermediate parse tree that could easily be used to generate all sorts of output. =head1 HISTORY Shinya Yamada wrote the first Augment->HTML convertor in Java, and released it on August 20, 2000. Doug Engelbart made changes to his export script and suggested improvements to Shinya's work, which led to this rewrite of the convertor in Perl. I released the first version of a2h.pl on October 6, 2000. On October 9, 2000, I rewrote and released Augment.pm, an object-oriented version of the appropriate a2h.pl functions. =head1 AUTHOR Eugene Eric Kim =cut