Converting Bible translation from MS Word/LibreOffice

PostedMay 2, 2019

Last Updated OnMay 2, 2019

ByMondele

The files referred to in this article can be downloaded here: Tarangan_James_Philemon

While we normally encourage translation work to be done in one of our tools (Autographa, translationStudio, vMAST) sometimes it is already in process or is just better for the local team to do it in another program.

We received work that had been done in Microsoft Word. It had been formatted, so that the verse numbers were superscripted (like ¹ this). When this formatting has been done, it makes the document regular, and therefore easier to convert.

I don’t have MS Word, so I opened the document in LibreOffice. The first thing I did was go to File… and choose Export… The format I chose was XHTML (.html;.xhtml). This made a copy of the file with an extension of .html.

Now I opened the html file in a text editor. I used Bluefish, a cross-platform free open-source HTML editor.

At the top of the file was a lot of formatting information:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!--This file was converted to xhtml by LibreOffice - see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/xslt for the code.-->
<head profile="http://dublincore.org/documents/dcmi-terms/">
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<title xml:lang="en-US">
- no title specified</title>
<meta name="DCTERMS.title" content="" xml:lang="en-US"/>
<meta name="DCTERMS.language" content="en-US" scheme="DCTERMS.RFC4646"/>
<meta name="DCTERMS.source" content="http://xml.openoffice.org/odf2xhtml"/>
<meta name="DCTERMS.creator" content="lifestyle"/>
<meta name="DCTERMS.issued" content="2019-02-21T11:42:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.contributor" content="user"/>
<meta name="DCTERMS.modified" content="2019-02-21T11:45:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.provenance" content="" xml:lang="en-US"/>
<meta name="DCTERMS.subject" content="," xml:lang="en-US"/>
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" hreflang="en"/>
<link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" hreflang="en"/>
<link rel="schema.DCTYPE" href="http://purl.org/dc/dcmitype/" hreflang="en"/>
<link rel="schema.DCAM" href="http://purl.org/dc/dcam/" hreflang="en"/>
<style type="text/css">

@page { }
table { border-collapse:collapse; border-spacing:0; empty-cells:show }
td, th { vertical-align:top; font-size:12pt;}
h1, h2, h3, h4, h5, h6 { clear:both;}
ol, ul { margin:0; padding:0;}
li { list-style: none; margin:0; padding:0;}
/* "li span.odfLiEnd" - IE 7 issue*/
li span. { clear: both; line-height:0; width:0; height:0; margin:0; padding:0; }
span.footnodeNumber { padding-right:1em; }
span.annotation_style_by_filter { font-size:95%; font-family:Arial; background-color:#fff000; margin:0; border:0; padding:0; }
span.heading_numbering { margin-right: 0.8rem; }* { margin:0;}
.gr1 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0146in; margin-left:0.1252in; margin-right:0.1402in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0in;min-width:0in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.gr2 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0028in; margin-left:0.1252in; margin-right:0.1252in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0.2728in;min-width:0.1409in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.Footer { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.Header { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P1 { color:#00000a; font-size:11pt; text-align:right ! important; font-family:Calibri; writing-mode:lr-tb; line-height:200%; }
.P10 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; }
.P11 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P12 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P2 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P3 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P4 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P5 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; font-weight:bold; }
.P6 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; font-weight:bold; }
.P7 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.Standard { font-size:11pt; font-family:Calibri; writing-mode:lr-tb; text-align:left ! important; color:#00000a; }
.T1 { font-style:italic; }
.T2 { font-style:italic; }
.T3 { font-size:16pt; font-weight:bold; }
.T5 { font-weight:bold; }
.T6 { font-size:18pt; font-weight:bold; }
.T7 { vertical-align:super; font-size:58%;}
.T8 { vertical-align:super; font-size:58%;font-weight:bold; }
/* ODF styles with no properties representable as CSS */
.Sect1 .T4 { }
</style>

The only part of this that is important to me is the .T7 and .T8, which are both vertical-align:super; — These will be superscripted, and are probably verse numbers or footnotes.

Sure enough, when you look down (below the <body… tag) you can see 3. (Verse 1 was skipped, using just the chapter number, and verse 2 had been missed, just being in the body.)

We’re ready to start cleaning up the text.

First, we remove everything before the beginning of the text. In this case, it is Salam Aban yudas. Then, we put in \c 1 for the first chapter, and \v 1 on a new line for the first verse. There should be a space, followed by the text of verse 1. I found the number 2 for verse 2 and did the same thing.

Now, most of the rest can be done automatically: I go to the Edit menu and choose Advanced Find and Replace (other programs may call this something different; for example, in Notepad++ for Windows, it’s under the Search menu, Replace…). Using a Regular expression so that we can clean it in one step, we search for

<span class="T7">(\d+)</span>

and we replace it with

\n\\v \1

Let’s explain this piece by piece.  should make sense: that’s what we saw in the formatting information at the beginning. Everything up to the next  tag will be superscripted, and should be a verse number.

The parentheses are to “capture” what matches inside them. That’s so that we don’t lose the verse number. \d is a regular expression that means “a digit”, or a number from 0-9. The + that follows tells us to match one or more of what comes before it. So, \d+ means match one or more digits. In some programs we also need to add ?, meaning “don’t take more than you need”. So, it would be \d+? inside the parentheses.

For the replacement, \n means “start a new line”. In USFM every verse needs to be on its own line. Then, we say \\v because we want to get \v. With regular expressions, the backslash \ is a special character (remember \d?) so if we actually want a \ we have to double it. \1 means “match the contents of the first pair of parentheses”. In other words, \1 will match our verse number. For the first verse in this file, that’s verse 3, but it will match all of them.

The final thing to notice is that there’s a space after the \1 in the replacement phrase. It’s important to have a space between the verse number and the verse text.

So, for the first verse with a superscripted number in this file, we have 3gwel jak ago being turned into

\v 3 gwel jak ago

(See how it’s on its own line?)

If you want to do all of this editing in LibreOffice, you may need to change the file extension of the HTML to .txt to see the HTML codes.

With the sample files, we have a couple of other things to look at. First, the “front matter” is missing, so no one will know what book this is. Full documentation about USFM can be found here: http://ubsicap.github.io/usfm/identification/index.html, or you can look at a project for another language that has been saved from translationStudio or Autographa.

For Jude, the first lines should be:

\id jud Regular
\ide usfm
\h Jude
\toc1 Jude
\toc2 Jude
\toc3 jud
\mt Jude

Let’s look at this line-by-line.

\id jud Regular tells programs that this is the book of Jude, and that it’s an OL translation. It could also say \id jud ULB, or \id jud Tarangan
\ide usfm tells programs that this is usfm, so they can decode it properly.
\h Jude is running header information. In this case, I would actually recommend using \h Yudas.
\toc1 is for the long form of the book name. In English, for example, we might put \toc1 The Epistle of James or \toc1 The Letter from James.
\toc2 is for the shorter form of the book name. \toc2 Yudas would be fine.
\toc3 is for an abbreviated name of the book. This is useful if you use a short form (Jhn 3:16) notation.
\mt is the title of the book as it’s printed at the top of the first page of the book. If you want to use multiple lines, you can use \mt1, \mt2, etc. In this case it should probably be \mt Salam Aban yudas.

Important note here: don’t change the book abbreviation in the first line: \id jud. This is the identifier for programs, and is based on English. All of the other places the book name appears, you can feel free to change it to the local name.

The file should be saved with the language code _ book code _ resource type _ project type. In this case, tre_jud_text_reg.usfm. (Please understand that I don’t know which Tarangan language this book is in, so I chose one of the language codes. Use the correct language code.)

When there are additional lines in the translation for section headings, USFM deals with these in a special way. (These are not part of the translation, as they are not from the original Bible texts — they are just to help people understand what they are about to read.) In this file, we have Allah On Aukum Dir-Dir Ago Daisago Sala. This should be on a line by itself with a \s tag and a space to show that it is a section name:

\s Allah On Aukum Dir-Dir Ago Daisago Sala

Finally, this file contains two books of the Bible: James and Philemon. These need to be put out into their own files, one each. Follow the same directions for Philemon that we have followed for James.

Make sure you check for verses that weren’t formatted correctly: on the first run through we were missing verses 1, 2, 5, 13, and 16. Verse 16 was missing altogether.

Converting Bible translation from MS Word/LibreOffice

Knowledge Base

Helpdesk

Bible Translation Tools

Bible Techs