diff options
Diffstat (limited to 'README.TXT')
-rw-r--r-- | README.TXT | 176 |
1 files changed, 176 insertions, 0 deletions
diff --git a/README.TXT b/README.TXT new file mode 100644 index 0000000..f5a016a --- /dev/null +++ b/README.TXT @@ -0,0 +1,176 @@ +* Contact information. + +Any feedback will be appreciated. You can email us at Daniel M. German +<dmg@uvic.ca> and Yuki Manabe <y-manabe@ist.osaka-u.ac.jp> + +* Introduction + +Ninka is license identification tool that identifies the license(s) +under which a source file is made available. + +This tool uses a source file as input and outputs the licenses +identified within that file. + +If you need to know the detail of Ninka, please see the following +paper: + +Daniel M. German, Yuki Manabe and Katsuro Inoue. A sentence-matching +method for automatic license identification of source code files. In +25nd IEEE/ACM International Conference on Automated Software +Engineering (ASE 2010). You can email me (dmg@uvic.ca) for a copy or +download it from + +http://turingmachine.org/~dmg/papers/dmg2010ninka.pdf + +If you use Ninka for research purposes, we would appreciate you cite +the above paper. + +* License + + Except for the directories comments and splitter, Ninka is licensed + under the AGPLv3+ + + Copyright (C) 2009-2010 Yuki Manabe and Daniel M. German + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU Affero General Public License as + published by the Free Software Foundation, either version 3 of the + License, or (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU Affero General Public License for more details. + + You should have received a copy of the GNU Affero General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. + + - splitter.pl is a derivative work of the Rule-based sentence + splitter script by Paul Paul Clough. Please see splitter/README + for details. + + - comments is based on a program to remove comments by Jon Newman, + it is released under the GNU General Public License Version 2 or + (at your option) any later version. + +* Requirements + +Perl version 5 + +* How to install + + 1. Unpack the distribution in a directory. + 2. Build and install comments (make sure it is somwehere in the + path) (see directory comments) + 3. Build splitter.pl (see splitter/README for instructions) + +* Usage: + +Ninka uses a pipe model (see below). Each step of the "pipe" creates a +file, but + +ninka.pl [options] [filename] + +Available options + -v verbose + -d delete intermediate files + -C force creation of comments file + -c stop after creation of comments + -S force creation of sentences file + -s stop after creation of sentences + -G force creation of goodsent file + -g stop after creation of goodsent + -T force creation of senttok file + -t stop after creation of senttok + -L force creation of license file + -f force all processing + + +Example: + + ninka.pl foo.c + +It will create five files: + + 1. foo.c.comments: extracted the first two comments blocks, where + the license is usually + 2. foo.c.sentences: creates the list of sentences in the license + statement + 3. foo.c.goodsent: contains sentences that are likely to be part of + a license statement + 4. foo.c.badsent: contains the sentences that are not part of + foo.c.goodsent + 5. foo.c.senttok: Each sentence in *.goodsent is converted into a + tokenized sentence (or unmatched, when none matches) + 6. foo.c.license: List of licenses found in the file. Its contains a + single line with 3 fields (semicolon delimited): + - Licenses + - Unmatched sentences in *.senttok that were not matched + + + + +* Ninka model + +Ninka uses a pipe-model. Each stage of the pipe does something very specific: + + 1. Comment extractor. + + - directory: extComments + + - command: extComments.pl, might use comments (included in distribution) + + - Purpose: Extracts top comments of source code. If no + comment extractor is known for the language, then extracts top lines from source (currently 700) + + - Creates <filename>.comments file + +2. Split sentences in comments + + - directory: splitter + + - command: splitter.pl + + - Purpose: Ninka works by matching sentences of licenses, hence + it needs to properly break text into sentences. + + - Outputs <filename>.sentences + +3. Filter "good" sentences. + + - directory filter + + - command: filter.pl + + - Purpose: some sentences are related to a license, some are + not. It is valuable to know if a file contains lines that look + like a license or not (e.g. to know that a file has no license) + + - Outputs: <filename>.goodsent, and <filename>.badsent (not used) + +4. Tokenizes sentences + + - Directory senttok + + - command: senttok.pl + + - Purpose: It creates a file that corresponds to the recognized + sentence tokens. For each sentence, it outputs its sentence token, or unknown otherwise. + + - Outputs: <filename>.senttok + +5. Matches sentences to licenses + + - Directory matcher + + - Command: matcher.pl + + - Purpose: looks at the sequence of sentence tokens and outputs the licenses found + + - Output: <filename>.license + +The script ninka.pl takes care of all these steps, and optionally removes +intermediary files, and writes to the stdout the licenses found. + +------ + |