HiveBrain v1.2.0
Get Started
← Back to all entries
snippetbashMinor

Format conversion of localization files | 3txt to xliff

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
conversionformat3txtfilesxlifflocalization

Problem

I'm trying to build an xliff for localization from 3 specific files: one contains a list of IDs, the other a list of source strings and the last, a list of translated strings.

Basically, each file contains 200,000 strings, and the process is taking so much time. How can I speed up this loop?

I use sed to replace ``. If you have better ideas, please tell me.

FILE_ID=$1
FILE_SOURCE=$2
FILE_TARGET=$3
TOT_STRING=$(wc -l "
echo ""
echo " "
echo "    "
echo "      "
echo "        Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo "        Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo "      "

    COUNTER=1

        while [  "$COUNTER" -le "$TOT_STRING" ]; do

        ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
        ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
        ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)

        if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
            echo "      "
            echo "        $(echo $ROW_SOURCE | sed 's//\>/g')"
            echo "        "
            echo "      "
        else
            echo "      "
            echo "        $(echo $ROW_SOURCE | sed 's//\>/g')"
            echo "        $(echo $ROW_TARGET | sed 's//\>/g')"
            echo "      "
        fi

        COUNTER=$(( $COUNTER + 1 ))
        done

echo "       "
echo "    "
echo "  "
echo ""

exit

Solution

Improving speed

For each line, running 3 sed commands to extract the n-th line from 3 files,
and then running further 2-4 sed commands is of course slow.

My first recommendation would be to implement this in another scripting language, say, Python.

If you really want to do this in Bash, you could:

  • Combine the 3 files into one, with their lines interleaved. That is, take the 1st line from each file, then the 2nd from each file, and so on. And then in each iteration of your loop, read 3 lines.



  • Instead of transforming the ` by running sed for each line, run sed just once for the entire input



If we can assume that all 3 files have the same number of lines,
then you can create the input with lines interleaved like this:

paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET"


When replacing
using sed, you can do it with a single sed command using multiple -e flags, like this:

sed -e 's//\>/g'


Putting it together:

paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET" | sed -e 's//\>/g' | \
for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
    read ROW_ID
    read ROW_SOURCE
    read ROW_TARGET
    # ...
done


Simplify

This is unnecessarily complicated:

ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)


You can write much simpler:

ROW_ID=$(sed -n ${COUNTER}p $FILE_ID)
ROW_SOURCE=$(sed -n ${COUNTER}p $FILE_SOURCE)
ROW_TARGET=$(sed -n ${COUNTER}p $FILE_TARGET)


Counting loops in Bash

Instead of this:

COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
    # do something
    COUNTER=$(( $COUNTER + 1 ))
done


This is equivalent, but cleaner and simpler:

for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
    # do something
done


Naming

TOT_STRING is a strange name for a variable with an integer value.

Printing large text

Instead of this:

echo ""
echo ""
echo " "
echo "    "
echo "      "
echo "        Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo "        Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo "      "


A simpler way to write:

cat 

 
    
      
        Project-Id-Version: 1.0
 Report-Msgid-Bugs-To: email@example.com
POT-Creation-Date: +0200
PO-Revision-Date: +0200
Last-Translator: JohnnyKing
Language-Team: JohnnyKing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Generator: csv2xliff.sh

        Project-Id-Version: 1.0
 Report-Msgid-Bugs-To: email@example.com
POT-Creation-Date: +0200
PO-Revision-Date: +0200
Last-Translator: JohnnyKing
Language-Team: JohnnyKing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Generator: csv2xliff.sh

      
EOF


Indentation

The indentation here is odd:

echo "X-Generator: csv2xliff.sh"
echo ""
echo "      "

    COUNTER=1

        while [  "$COUNTER" -le "$TOT_STRING" ]; do

        ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
        ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
        ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)

        if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
            echo "      "
            echo "        $(echo $ROW_SOURCE | sed 's//\>/g')"
            echo "        "
            echo "      "
        else
            echo "      "
            echo "        $(echo $ROW_SOURCE | sed 's//\>/g')"
            echo "        $(echo $ROW_TARGET | sed 's//\>/g')"
            echo "      "
        fi

        COUNTER=$(( $COUNTER + 1 ))
        done

echo "       "
echo "    "
echo "  "
echo ""


It would be more natural this way:

``
echo "X-Generator: csv2xliff.sh"
echo ""
echo " "

COUNTER=1

while [ "$COUNTER" -le "$TOT_STRING" ]; do
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)

if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " "
echo " "
else
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " $(echo $ROW_TARGET | sed 's//\>/g')"
echo " "
fi

COUNTER=$(( $COUNTER + 1 ))
done

echo " "
e

Code Snippets

paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET"
sed -e 's/</\&lt;/g' -e 's/>/\&gt;/g'
paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET" | sed -e 's/</\&lt;/g' -e 's/>/\&gt;/g' | \
for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
    read ROW_ID
    read ROW_SOURCE
    read ROW_TARGET
    # ...
done
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
ROW_ID=$(sed -n ${COUNTER}p $FILE_ID)
ROW_SOURCE=$(sed -n ${COUNTER}p $FILE_SOURCE)
ROW_TARGET=$(sed -n ${COUNTER}p $FILE_TARGET)

Context

StackExchange Code Review Q#113000, answer score: 2

Revisions (0)

No revisions yet.