diff options
Diffstat (limited to 'README')
-rw-r--r-- | README | 124 |
1 files changed, 47 insertions, 77 deletions
@@ -11,16 +11,13 @@ under which a source file is made available. This tool uses a source file as input and outputs the licenses identified within that file. -If you need to know the detail of Ninka, please see the following -paper: +If you need to know the detail of Ninka, please see the following paper: Daniel M. German, Yuki Manabe and Katsuro Inoue. A sentence-matching method for automatic license identification of source code files. In 25nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2010). You can email me (dmg@uvic.ca) for a copy or -download it from - -http://turingmachine.org/~dmg/papers/dmg2010ninka.pdf +download it from http://turingmachine.org/~dmg/papers/dmg2010ninka.pdf. If you use Ninka for research purposes, we would appreciate you cite the above paper. @@ -28,13 +25,13 @@ the above paper. * Contributors - Paul Clough for his code to split sentences -- Anthony Kohan for writing the excel and sqlite backends. -- Armijn Hemel from Tjaldur Software Governance Solutions for multiple bug reports and suggestions +- Anthony Kohan for writing the excel and sqlite backends +- Armijn Hemel from Tjaldur Software Governance Solutions for multiple bug reports and suggestions +- René Scheibe for modularizing the code * License - Except for the directories comments and splitter, Ninka is licensed - under the GPLv2+ + Ninka is licensed under the GPLv2+: Copyright (C) 2009-2014 Yuki Manabe and Daniel M. German @@ -51,59 +48,41 @@ the above paper. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. - - splitter.pl is a derivative work of the Rule-based sentence - splitter script by Paul Paul Clough. Please see splitter/README - for details. + Ninka::SentenceExtraxtor is a derivative work of the rule-based sentence + splitter script by Paul Paul Clough. - - comments is based on a program to remove comments by Jon Newman, - it is released under the GNU General Public License Version 2 or - (at your option) any later version. + comments is based on a program to remove comments by Jon Newman. * Requirements - Perl version 5 or above -- for ninka-excel.pl: Perl module Spreadsheet::WriteExcel - https://metacpan.org/release/Spreadsheet-WriteExcel/ -- for ninka-sqlite.pl: Perl module DBD::SQLite +- for ninka-excel: Perl module Spreadsheet::WriteExcel + https://metacpan.org/release/Spreadsheet-WriteExcel +- for ninka-sqlite: Perl module DBD::SQLite https://metacpan.org/release/DBD-SQLite * How to install 1. Unpack the distribution in a directory. - 2. Optional: Build and install comments (make sure it is somwehere in the - path) (see directory comments) - + 2. Optional: Build and install comments (make sure it is somwehere in the path) (see directory comments) -* Usage: +* Usage -Ninka uses a pipe model (see below). Each step of the "pipe" creates a -file, but +ninka [options] filename -ninka.pl [options] [filename] +Available options: -Available options + -i create intermediary files -v verbose - -d delete intermediate files - -C force creation of comments file - -c stop after creation of comments - -S force creation of sentences file - -s stop after creation of sentences - -G force creation of goodsent file - -g stop after creation of goodsent - -T force creation of senttok file - -t stop after creation of senttok - -L force creation of license file - -f force all processing - Example: - ninka.pl foo.c + ninka -i foo.c It will create five files: - 1. foo.c.comments: extracted the first two comments blocks, where - the license is usually + 1. foo.c.comments: extracted the first comments blocks, where + the license is usually included 2. foo.c.sentences: creates the list of sentences in the license statement 3. foo.c.goodsent: contains sentences that are likely to be part of @@ -117,69 +96,60 @@ It will create five files: - Licenses - Unmatched sentences in *.senttok that were not matched - - +The files are not required for Ninka's functionality. But they can help +to debug license detection issues. * Ninka model Ninka uses a pipe-model. Each stage of the pipe does something very specific: - 1. Comment extractor. +1. Comment extractor - - directory: extComments + - Module: Ninka::CommentExtractor - - command: extComments.pl, might use comments (included in distribution) + - Purpose: Extracts top comments of source code. + If no comment extractor is known for the language, + then extracts top lines from source (currently 700) - - Purpose: Extracts top comments of source code. If no - comment extractor is known for the language, then extracts top lines from source (currently 700) - - - Creates <filename>.comments file + - Output: <filename>.comments 2. Split sentences in comments - - directory: splitter - - - command: splitter.pl - - - Purpose: Ninka works by matching sentences of licenses, hence - it needs to properly break text into sentences. - - - Outputs <filename>.sentences - -3. Filter "good" sentences. + - Module: Ninka::SentenceExtractor - - directory filter + - Purpose: Ninka works by matching sentences of licenses, + hence it needs to properly break text into sentences. - - command: filter.pl + - Output: <filename>.sentences - - Purpose: some sentences are related to a license, some are - not. It is valuable to know if a file contains lines that look - like a license or not (e.g. to know that a file has no license) +3. Filter "good" sentences - - Outputs: <filename>.goodsent, and <filename>.badsent (not used) + - Module: Ninka::SentenceFilter -4. Tokenizes sentences + - Purpose: Some sentences are related to a license, some are not. + It is valuable to know if a file contains lines that look like + a license or not (e.g. to know that a file has no license). - - Directory senttok + - Output: <filename>.goodsent and <filename>.badsent - - command: senttok.pl +4. Tokenize sentences - - Purpose: It creates a file that corresponds to the recognized - sentence tokens. For each sentence, it outputs its sentence token, or unknown otherwise. + - Module: Ninka::SentenceTokenizer - - Outputs: <filename>.senttok + - Purpose: It creates a file that corresponds to the recognized sentence tokens. + For each sentence, it outputs its sentence token, or unknown otherwise. -5. Matches sentences to licenses + - Output: <filename>.senttok - - Directory matcher +5. Match sentences to licenses - - Command: matcher.pl + - Module: Ninka::LicenseMatcher - - Purpose: looks at the sequence of sentence tokens and outputs the licenses found + - Purpose: It looks at the sentence tokens and outputs the licenses found. - Output: <filename>.license -The script ninka.pl takes care of all these steps, and optionally removes +The script ninka takes care of all these steps, and optionally creates intermediary files, and writes to the stdout the licenses found. ------ |