I'm looking for an excel macro expert to transform a dictionary (Polish English) into a termbase (input file to be imported to Memsource CAT software). The input file is attached for your perusal - in consists of 61031 entries, sample input entry, steps I took to create final termbase for that entry and the final termbase for that entry. The output file needs to comply with the following criteria:
1. Terms organized into columns where each column represents a language
2. Make sure there is the appropriate language code in the header of each column (in our case pl, en)
Example of dictionary entry:
analiza f 1. analysis 2. chem. analysis 3. mat. analysis; calculus
~ absorpcyjna absorption analysis
first word "analiza" is the Polish term
letter "f" "m" "a" means feminine, masculine, neuter
1. 2. 3. numbers denote different versions of translation depending on context
"chem." is an abbreviation from chemical and means the area of subject matter
"analysis" is the English term
"~" means that the following term is a child term and is created by joining parent term "analiza" and child term "absorpcyjna" to create the term "analiza absorpcyjna"
"absorption analysis" is the English child term
OUTPUT FILE STRUCTURE
The basic structure of the output file needs to be this:
column A: Polish term
column B: English term
column C-X: second and next (if available) English term
QUALITY ASSURANCE (QA)
1. Check for blank rows (present in source file) and remove them
2. Convention "zob." means "refer to" and it sends to a Polish terms of the same meaning. Use the macro to lookup that synonymous term and insert it's English equivalent
3. Check that the number of Polish & English terms is equal
4. Subsequent english terms are divided by a comma "," OR and a semicolon ";"
5. Bracketed sentences are explanations - they can be ignored and don't need to be included on the output files
6. Abbreviations like "mat." or "chem." denote subject matter areas and are redundant - should be ignored and excluded from output file. The English term is located directly after those abbreviations. How to recognise those abbreviations? They are usually incomplete words with a "." (dot) directly after last letter in the word.
7. Remove entries shorter than 3 letters (1 and 2 letters long)
8. There are some cells where there's no distinguishable marker between terms eg.
"~ ~ wielu zmiennych multivariate analysis of covariance"
In that instance, the macro should also check for language to check where Polish term ends and where English one begins
9. Spot-check your work - I will spot-check random 900 entries to ensure proper quality and structure
1. Self-reliant, self-starter
2. Highly experienced in writing Excel macros and/or data analytics and/or data science
3. Great mathematical problem solver capable of building complex algorithms
For 100% payment I require the below milestones are adhered to and that the working files and shared with me for acceptance accordingly.
I. Polish terms column is created and passes 300 entry spot check by me
II. English terms columns are created and pass 300 entry sport check by me
III. Final term base is created, has correct structure, equal amount of terms for Polish and English terms and passes 300 entry spot check by me AND successfully loads into MemSource CAT software.