Field Extraction Good Practice Guide
Joined: 29 Jun 2011
Topic: Field Extraction Good Practice Guide
Posted: 18 Jan 2012 at 8:43pm
Field Extraction good practice guide
During my working week a get a lot of problems with Email2DB that end up being solved by following some good field extraction practices, I have thus become motivated to write this FAQ forum thread in order to assist those in need.
Please feel free to post your own best practices, in an attempt to build up a fairly comprehensive FAQ just for this subject.
This guide is here to help you in build field extractions that are more likely to work when moving from the Run Test button to actual emails.
Some background information: Almost all emails have two formats, plain text and HTML. Most email clients will show the HTML version of the email whereas email2db by default uses the plain text – with some emails the plain text and HTML versions may be formatted differently. This would mean copy and pasting what you see in Outlook will not give an accurate representation of the plain text version of the email and can often lead to the field extractions not working as expected when you start processing real emails. This most common reason for this not working is because new lines do not appear in the plain text at the same point where they are in the HTML or that the spacing between words or phrases is not the same.
This is the most common cause of problems when moving from Run Test to actual emails.
Due to that my number one best practice is to not build you field extractions based on new lines or spaces unless you are absolutely sure that the format of the message is plain text only and the format is fixed, in these circumstances new line feeds in particular are extremely valuable.
My next best practice is to use regular expression wherever possible, providing that the expression can be built correctly for what you want then aesthetic formatting discrepancies in the email wouldn’t make a difference and the field would be extract as expected, even if the email format was changed aesthetically in the future. You can couple a look for and a regular expression together. So you can look for the phrase “Serial No.” and then look for [0-9][0-9][0-9][A-Z][0-9] for example, this would extract serial number with 3 numbers followed by a letter then a final number, e.g. 000A0, 999Z9 or 123V8. This can be useful if there are a few occurrences of phrases that match the regular expression.
At the end of the day the best way to extract a specific field depends entirely on the environment, there is no one size fits all solution to field extractions. As always, if the formatting of the messages changes or if information is removed or new information added then you will need to go back to the drawing board, retest and rebuild those extractions again.
This post is here to help people find the best method for them or to introduce new ways of building extractions that you may not have thought about before.
Edited by Liam - 19 Jan 2012 at 12:57pm
|Forum Jump||Forum Permissions
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot delete your posts in this forum
You cannot edit your posts in this forum
You cannot create polls in this forum
You cannot vote in polls in this forum