Parker Software Ltd Homepage
Forum Home Forum Home > Email2DB Email Parser > FAQ's
  New Posts New Posts RSS Feed: Field Extraction Good Practice Guide
  FAQ FAQ  Forum Search   Calendar   Register Register  Login Login

Field Extraction Good Practice Guide

 Post Reply Post Reply
Author
Message
Liam View Drop Down
Admin Group
Admin Group
Avatar

Joined: 29 Jun 2011
Location: Stoke-on-Trent
Posts: 238
Post Options Post Options   Quote Liam Quote  Post ReplyReply Direct Link To This Post Topic: Field Extraction Good Practice Guide
    Posted: 18 Jan 2012 at 8:43pm

Field Extraction good practice guide

 

During my working week a get a lot of problems with Email2DB that end up being solved by following some good field extraction practices, I have thus become motivated to write this FAQ forum thread in order to assist those in need.

 

Please feel free to post your own best practices, in an attempt to build up a fairly comprehensive FAQ just for this subject.

 

This guide is here to help you in build field extractions that are more likely to work when moving from the Run Test button to actual emails.

 

Some background information: Almost all emails have two formats, plain text and HTML. Most email clients will show the HTML version of the email whereas email2db by default uses the plain text – with some emails the plain text and HTML versions may be formatted differently. This would mean copy and pasting what you see in Outlook will not give an accurate representation of the plain text version of the email and can often lead to the field extractions not working as expected when you start processing real emails. This most common reason for this not working is because new lines do not appear in the plain text at the same point where they are in the HTML or that the spacing between words or phrases is not the same.

 

This is the most common cause of problems when moving from Run Test to actual emails.

 

Due to that my number one best practice is to not build you field extractions based on new lines or spaces unless you are absolutely sure that the format of the message is plain text only and the format is fixed, in these circumstances new line feeds in particular are extremely valuable.

 

My next best practice is to use regular expression wherever possible, providing that the expression can be built correctly for what you want then aesthetic formatting discrepancies in the email wouldn’t make a difference and the field would be extract as expected, even if the email format was changed aesthetically in the future. You can couple a look for and a regular expression together. So you can look for the phrase “Serial No.” and then look for [0-9][0-9][0-9][A-Z][0-9] for example, this would extract serial number with 3 numbers followed by a letter then a final number, e.g. 000A0, 999Z9 or 123V8. This can be useful if there are a few occurrences of phrases that match the regular expression.


At the end of the day the best way to extract a specific field depends entirely on the environment, there is no one size fits all solution to field extractions. As always, if the formatting of the messages changes or if information is removed or new information added then you will need to go back to the drawing board, retest and rebuild those extractions again. 


This post is here to help people find the best method for them or to introduce new ways of building extractions that you may not have thought about before.


Enjoy!





Edited by Liam - 19 Jan 2012 at 12:57pm
Back to Top
 Post Reply Post Reply

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.125 seconds.
These are the forums for Parker Software, developers of Live Chat Software: WhosOn and Email Automation Software: Email2DB.