snippetbashMinor
Format conversion of localization files | 3txt to xliff
Viewed 0 times
conversionformat3txtfilesxlifflocalization
Problem
I'm trying to build an xliff for localization from 3 specific files: one contains a list of IDs, the other a list of source strings and the last, a list of translated strings.
Basically, each file contains 200,000 strings, and the process is taking so much time. How can I speed up this loop?
I use sed to replace ``. If you have better ideas, please tell me.
Basically, each file contains 200,000 strings, and the process is taking so much time. How can I speed up this loop?
I use sed to replace ``. If you have better ideas, please tell me.
FILE_ID=$1
FILE_SOURCE=$2
FILE_TARGET=$3
TOT_STRING=$(wc -l "
echo ""
echo " "
echo " "
echo " "
echo " Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo " Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo " "
COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " "
echo " "
else
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " $(echo $ROW_TARGET | sed 's//\>/g')"
echo " "
fi
COUNTER=$(( $COUNTER + 1 ))
done
echo " "
echo " "
echo " "
echo ""
exitSolution
Improving speed
For each line, running 3
and then running further 2-4
My first recommendation would be to implement this in another scripting language, say, Python.
If you really want to do this in Bash, you could:
If we can assume that all 3 files have the same number of lines,
then you can create the input with lines interleaved like this:
When replacing
echo "X-Generator: csv2xliff.sh"
echo ""
echo " "
COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " "
echo " "
else
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " $(echo $ROW_TARGET | sed 's//\>/g')"
echo " "
fi
COUNTER=$(( $COUNTER + 1 ))
done
echo " "
e
For each line, running 3
sed commands to extract the n-th line from 3 files,and then running further 2-4
sed commands is of course slow.My first recommendation would be to implement this in another scripting language, say, Python.
If you really want to do this in Bash, you could:
- Combine the 3 files into one, with their lines interleaved. That is, take the 1st line from each file, then the 2nd from each file, and so on. And then in each iteration of your loop, read 3 lines.
- Instead of transforming the `
by runningsedfor each line, runsedjust once for the entire input
If we can assume that all 3 files have the same number of lines,
then you can create the input with lines interleaved like this:
paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET"When replacing
using sed, you can do it with a single sed command using multiple -e flags, like this:
sed -e 's//\>/g'
Putting it together:
paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET" | sed -e 's//\>/g' | \
for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
read ROW_ID
read ROW_SOURCE
read ROW_TARGET
# ...
done
Simplify
This is unnecessarily complicated:
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
You can write much simpler:
ROW_ID=$(sed -n ${COUNTER}p $FILE_ID)
ROW_SOURCE=$(sed -n ${COUNTER}p $FILE_SOURCE)
ROW_TARGET=$(sed -n ${COUNTER}p $FILE_TARGET)
Counting loops in Bash
Instead of this:
COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
# do something
COUNTER=$(( $COUNTER + 1 ))
done
This is equivalent, but cleaner and simpler:
for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
# do something
done
Naming
TOT_STRING is a strange name for a variable with an integer value.
Printing large text
Instead of this:
echo ""
echo ""
echo " "
echo " "
echo " "
echo " Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo " Project-Id-Version: 1.0"
echo " Report-Msgid-Bugs-To: email@example.com"
echo "POT-Creation-Date: $time+0200"
echo "PO-Revision-Date: $time+0200"
echo "Last-Translator: JohnnyKing"
echo "Language-Team: JohnnyKing"
echo "MIME-Version: 1.0"
echo "Content-Type: text/plain; charset=UTF-8"
echo "Content-Transfer-Encoding: 8bit"
echo "X-Generator: csv2xliff.sh"
echo ""
echo " "
A simpler way to write:
cat
Project-Id-Version: 1.0
Report-Msgid-Bugs-To: email@example.com
POT-Creation-Date: +0200
PO-Revision-Date: +0200
Last-Translator: JohnnyKing
Language-Team: JohnnyKing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Generator: csv2xliff.sh
Project-Id-Version: 1.0
Report-Msgid-Bugs-To: email@example.com
POT-Creation-Date: +0200
PO-Revision-Date: +0200
Last-Translator: JohnnyKing
Language-Team: JohnnyKing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Generator: csv2xliff.sh
EOF
Indentation
The indentation here is odd:
echo "X-Generator: csv2xliff.sh"
echo ""
echo " "
COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " "
echo " "
else
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " $(echo $ROW_TARGET | sed 's//\>/g')"
echo " "
fi
COUNTER=$(( $COUNTER + 1 ))
done
echo " "
echo " "
echo " "
echo ""
It would be more natural this way:
``echo "X-Generator: csv2xliff.sh"
echo ""
echo " "
COUNTER=1
while [ "$COUNTER" -le "$TOT_STRING" ]; do
ROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)
if [ "$ROW_SOURCE" = "$ROW_TARGET" ]; then
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " "
echo " "
else
echo " "
echo " $(echo $ROW_SOURCE | sed 's//\>/g')"
echo " $(echo $ROW_TARGET | sed 's//\>/g')"
echo " "
fi
COUNTER=$(( $COUNTER + 1 ))
done
echo " "
e
Code Snippets
paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET"sed -e 's/</\</g' -e 's/>/\>/g'paste -d '\n' "$FILE_ID" "$FILE_SOURCE" "$FILE_TARGET" | sed -e 's/</\</g' -e 's/>/\>/g' | \
for ((COUNTER = 1; COUNTER <= TOT_STRING; ++COUNTER)); do
read ROW_ID
read ROW_SOURCE
read ROW_TARGET
# ...
doneROW_ID=$(sed -n $(( $COUNTER ))p $FILE_ID)
ROW_SOURCE=$(sed -n $(( COUNTER ))p $FILE_SOURCE)
ROW_TARGET=$(sed -n $(( COUNTER ))p $FILE_TARGET)ROW_ID=$(sed -n ${COUNTER}p $FILE_ID)
ROW_SOURCE=$(sed -n ${COUNTER}p $FILE_SOURCE)
ROW_TARGET=$(sed -n ${COUNTER}p $FILE_TARGET)Context
StackExchange Code Review Q#113000, answer score: 2
Revisions (0)
No revisions yet.