summaryrefslogtreecommitdiff
path: root/lib
diff options
context:
space:
mode:
authorDouwe Maan <douwe@gitlab.com>2016-11-18 17:15:56 +0000
committerDouwe Maan <douwe@gitlab.com>2016-11-18 17:15:56 +0000
commit2717675fbc10c97904e6a3eebf498d2c53fe5ce5 (patch)
tree098245239581426ea1a1a917eed62bdb784905ff /lib
parent88479f7f3071c2d79447896deeea63fb78b175df (diff)
parente1b868307169b562a595b5cb41bda7e8b984600f (diff)
downloadgitlab-ce-2717675fbc10c97904e6a3eebf498d2c53fe5ce5.tar.gz
Merge branch 'bugfix/html-only-mail' into 'master'
add parsing support for incoming html email ## What does this MR do? Fixes #18388 by adding support for parsing HTML email ## Are there points in the code the reviewer needs to double check? The new class, Gitlab::Email::HTMLParser, which needs to translate the HTML content to text and also delete replies, as they are not necessarily in the correct format to be caught by EmailReplyParser. The solution I found that should work for any HTML-formatted email is to remove all `<table>` and `<blockquote>` tags. Actual `<table>` elements (to be interpreted by markdown) should already be encoded with e.g. `&lt;table&gt;` - the only failure mode is if there is an *actual* HTML table in the content itself, which we wouldn't be able to support easily anyways. The gem `html2text` traverses the HTML tree and outputs text - and markdown in the case of HTML links or images. See merge request !7397
Diffstat (limited to 'lib')
-rw-r--r--lib/gitlab/email/html_parser.rb34
-rw-r--r--lib/gitlab/email/reply_parser.rb19
2 files changed, 47 insertions, 6 deletions
diff --git a/lib/gitlab/email/html_parser.rb b/lib/gitlab/email/html_parser.rb
new file mode 100644
index 00000000000..a4ca62bfc41
--- /dev/null
+++ b/lib/gitlab/email/html_parser.rb
@@ -0,0 +1,34 @@
+module Gitlab
+ module Email
+ class HTMLParser
+ def self.parse_reply(raw_body)
+ new(raw_body).filtered_text
+ end
+
+ attr_reader :raw_body
+ def initialize(raw_body)
+ @raw_body = raw_body
+ end
+
+ def document
+ @document ||= Nokogiri::HTML.parse(raw_body)
+ end
+
+ def filter_replies!
+ document.xpath('//blockquote').each(&:remove)
+ document.xpath('//table').each(&:remove)
+ end
+
+ def filtered_html
+ @filtered_html ||= begin
+ filter_replies!
+ document.inner_html
+ end
+ end
+
+ def filtered_text
+ @filtered_text ||= Html2Text.convert(filtered_html)
+ end
+ end
+ end
+end
diff --git a/lib/gitlab/email/reply_parser.rb b/lib/gitlab/email/reply_parser.rb
index 3411eb1d9ce..85402c2a278 100644
--- a/lib/gitlab/email/reply_parser.rb
+++ b/lib/gitlab/email/reply_parser.rb
@@ -23,19 +23,26 @@ module Gitlab
private
def select_body(message)
- text = message.text_part if message.multipart?
- text ||= message if message.content_type !~ /text\/html/
+ if message.multipart?
+ part = message.text_part || message.html_part || message
+ else
+ part = message
+ end
- return "" unless text
+ decoded = fix_charset(part)
- text = fix_charset(text)
+ return "" unless decoded
# Certain trigger phrases that means we didn't parse correctly
- if text =~ /(Content\-Type\:|multipart\/alternative|text\/plain)/
+ if decoded =~ /(Content\-Type\:|multipart\/alternative|text\/plain)/
return ""
end
- text
+ if (part.content_type || '').include? 'text/html'
+ HTMLParser.parse_reply(decoded)
+ else
+ decoded
+ end
end
# Force encoding to UTF-8 on a Mail::Message or Mail::Part