summaryrefslogtreecommitdiff
path: root/README.TXT
diff options
context:
space:
mode:
Diffstat (limited to 'README.TXT')
-rw-r--r--README.TXT176
1 files changed, 176 insertions, 0 deletions
diff --git a/README.TXT b/README.TXT
new file mode 100644
index 0000000..f5a016a
--- /dev/null
+++ b/README.TXT
@@ -0,0 +1,176 @@
+* Contact information.
+
+Any feedback will be appreciated. You can email us at Daniel M. German
+<dmg@uvic.ca> and Yuki Manabe <y-manabe@ist.osaka-u.ac.jp>
+
+* Introduction
+
+Ninka is license identification tool that identifies the license(s)
+under which a source file is made available.
+
+This tool uses a source file as input and outputs the licenses
+identified within that file.
+
+If you need to know the detail of Ninka, please see the following
+paper:
+
+Daniel M. German, Yuki Manabe and Katsuro Inoue. A sentence-matching
+method for automatic license identification of source code files. In
+25nd IEEE/ACM International Conference on Automated Software
+Engineering (ASE 2010). You can email me (dmg@uvic.ca) for a copy or
+download it from
+
+http://turingmachine.org/~dmg/papers/dmg2010ninka.pdf
+
+If you use Ninka for research purposes, we would appreciate you cite
+the above paper.
+
+* License
+
+ Except for the directories comments and splitter, Ninka is licensed
+ under the AGPLv3+
+
+ Copyright (C) 2009-2010 Yuki Manabe and Daniel M. German
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU Affero General Public License as
+ published by the Free Software Foundation, either version 3 of the
+ License, or (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU Affero General Public License for more details.
+
+ You should have received a copy of the GNU Affero General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+ - splitter.pl is a derivative work of the Rule-based sentence
+ splitter script by Paul Paul Clough. Please see splitter/README
+ for details.
+
+ - comments is based on a program to remove comments by Jon Newman,
+ it is released under the GNU General Public License Version 2 or
+ (at your option) any later version.
+
+* Requirements
+
+Perl version 5
+
+* How to install
+
+ 1. Unpack the distribution in a directory.
+ 2. Build and install comments (make sure it is somwehere in the
+ path) (see directory comments)
+ 3. Build splitter.pl (see splitter/README for instructions)
+
+* Usage:
+
+Ninka uses a pipe model (see below). Each step of the "pipe" creates a
+file, but
+
+ninka.pl [options] [filename]
+
+Available options
+ -v verbose
+ -d delete intermediate files
+ -C force creation of comments file
+ -c stop after creation of comments
+ -S force creation of sentences file
+ -s stop after creation of sentences
+ -G force creation of goodsent file
+ -g stop after creation of goodsent
+ -T force creation of senttok file
+ -t stop after creation of senttok
+ -L force creation of license file
+ -f force all processing
+
+
+Example:
+
+ ninka.pl foo.c
+
+It will create five files:
+
+ 1. foo.c.comments: extracted the first two comments blocks, where
+ the license is usually
+ 2. foo.c.sentences: creates the list of sentences in the license
+ statement
+ 3. foo.c.goodsent: contains sentences that are likely to be part of
+ a license statement
+ 4. foo.c.badsent: contains the sentences that are not part of
+ foo.c.goodsent
+ 5. foo.c.senttok: Each sentence in *.goodsent is converted into a
+ tokenized sentence (or unmatched, when none matches)
+ 6. foo.c.license: List of licenses found in the file. Its contains a
+ single line with 3 fields (semicolon delimited):
+ - Licenses
+ - Unmatched sentences in *.senttok that were not matched
+
+
+
+
+* Ninka model
+
+Ninka uses a pipe-model. Each stage of the pipe does something very specific:
+
+ 1. Comment extractor.
+
+ - directory: extComments
+
+ - command: extComments.pl, might use comments (included in distribution)
+
+ - Purpose: Extracts top comments of source code. If no
+ comment extractor is known for the language, then extracts top lines from source (currently 700)
+
+ - Creates <filename>.comments file
+
+2. Split sentences in comments
+
+ - directory: splitter
+
+ - command: splitter.pl
+
+ - Purpose: Ninka works by matching sentences of licenses, hence
+ it needs to properly break text into sentences.
+
+ - Outputs <filename>.sentences
+
+3. Filter "good" sentences.
+
+ - directory filter
+
+ - command: filter.pl
+
+ - Purpose: some sentences are related to a license, some are
+ not. It is valuable to know if a file contains lines that look
+ like a license or not (e.g. to know that a file has no license)
+
+ - Outputs: <filename>.goodsent, and <filename>.badsent (not used)
+
+4. Tokenizes sentences
+
+ - Directory senttok
+
+ - command: senttok.pl
+
+ - Purpose: It creates a file that corresponds to the recognized
+ sentence tokens. For each sentence, it outputs its sentence token, or unknown otherwise.
+
+ - Outputs: <filename>.senttok
+
+5. Matches sentences to licenses
+
+ - Directory matcher
+
+ - Command: matcher.pl
+
+ - Purpose: looks at the sequence of sentence tokens and outputs the licenses found
+
+ - Output: <filename>.license
+
+The script ninka.pl takes care of all these steps, and optionally removes
+intermediary files, and writes to the stdout the licenses found.
+
+------
+