diff options
author | Douwe Maan <douwe@gitlab.com> | 2016-11-18 17:15:56 +0000 |
---|---|---|
committer | Douwe Maan <douwe@gitlab.com> | 2016-11-18 17:15:56 +0000 |
commit | 2717675fbc10c97904e6a3eebf498d2c53fe5ce5 (patch) | |
tree | 098245239581426ea1a1a917eed62bdb784905ff /lib | |
parent | 88479f7f3071c2d79447896deeea63fb78b175df (diff) | |
parent | e1b868307169b562a595b5cb41bda7e8b984600f (diff) | |
download | gitlab-ce-2717675fbc10c97904e6a3eebf498d2c53fe5ce5.tar.gz |
Merge branch 'bugfix/html-only-mail' into 'master'
add parsing support for incoming html email
## What does this MR do?
Fixes #18388 by adding support for parsing HTML email
## Are there points in the code the reviewer needs to double check?
The new class, Gitlab::Email::HTMLParser, which needs to translate the HTML content to text and also delete replies, as they are not necessarily in the correct format to be caught by EmailReplyParser. The solution I found that should work for any HTML-formatted email is to remove all `<table>` and `<blockquote>` tags. Actual `<table>` elements (to be interpreted by markdown) should already be encoded with e.g. `<table>` - the only failure mode is if there is an *actual* HTML table in the content itself, which we wouldn't be able to support easily anyways.
The gem `html2text` traverses the HTML tree and outputs text - and markdown in the case of HTML links or images.
See merge request !7397
Diffstat (limited to 'lib')
-rw-r--r-- | lib/gitlab/email/html_parser.rb | 34 | ||||
-rw-r--r-- | lib/gitlab/email/reply_parser.rb | 19 |
2 files changed, 47 insertions, 6 deletions
diff --git a/lib/gitlab/email/html_parser.rb b/lib/gitlab/email/html_parser.rb new file mode 100644 index 00000000000..a4ca62bfc41 --- /dev/null +++ b/lib/gitlab/email/html_parser.rb @@ -0,0 +1,34 @@ +module Gitlab + module Email + class HTMLParser + def self.parse_reply(raw_body) + new(raw_body).filtered_text + end + + attr_reader :raw_body + def initialize(raw_body) + @raw_body = raw_body + end + + def document + @document ||= Nokogiri::HTML.parse(raw_body) + end + + def filter_replies! + document.xpath('//blockquote').each(&:remove) + document.xpath('//table').each(&:remove) + end + + def filtered_html + @filtered_html ||= begin + filter_replies! + document.inner_html + end + end + + def filtered_text + @filtered_text ||= Html2Text.convert(filtered_html) + end + end + end +end diff --git a/lib/gitlab/email/reply_parser.rb b/lib/gitlab/email/reply_parser.rb index 3411eb1d9ce..85402c2a278 100644 --- a/lib/gitlab/email/reply_parser.rb +++ b/lib/gitlab/email/reply_parser.rb @@ -23,19 +23,26 @@ module Gitlab private def select_body(message) - text = message.text_part if message.multipart? - text ||= message if message.content_type !~ /text\/html/ + if message.multipart? + part = message.text_part || message.html_part || message + else + part = message + end - return "" unless text + decoded = fix_charset(part) - text = fix_charset(text) + return "" unless decoded # Certain trigger phrases that means we didn't parse correctly - if text =~ /(Content\-Type\:|multipart\/alternative|text\/plain)/ + if decoded =~ /(Content\-Type\:|multipart\/alternative|text\/plain)/ return "" end - text + if (part.content_type || '').include? 'text/html' + HTMLParser.parse_reply(decoded) + else + decoded + end end # Force encoding to UTF-8 on a Mail::Message or Mail::Part |