HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to represent text for a program to add punctuation to a block of text?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
punctuationrepresentblocktextprogramforhowadd

Problem

I want to try and make a program where I remove the punctuation from a block of text and it then inputs it in the correct places.
I have started with a simpler element of just using text and spaces, no commas, full stops etc.
The first angle I have been taking this is with machine learning, neural networks and simpler word count probabilities.
However the thing I am struggling with is the representation of the block of text. At the moment I give it a block of text and it reads say 20 letters then breaks it into test based on where it thinks a word is. This is ineffective though because it may just cut off half a word at the start/end of the block.
Just wondering if anyone had an idea of a way to use the block of text in this way in a better way to do this task!
Many thanks

Solution

I am making two assumptions here about things that, to me, are not quite clear from your question

  • When you say you removed punctuation, you actually meant whitespaces as well.



  • You tried neural networks, but not in a sequence-to-sequence fashion.



What's nice about your problem is that you can generate training data as much as you want just by taking a proper piece of text and removing all the punctuation.

This being said, I suggest you engage your problem as a machine translation task where you want to predict the proper text given the one without punctuation.

Today, the best way to do this is by sequence-to-sequence learning with deep neural network (just pick a few machine translation papers from last ACL (= Annual Meeting of the Association for Computational Linguistics) conference to learn more about that.

Concerning you actual question, how to represent the input, I'm not quite sure if you have to limit the size of your input beforehand or not. In any way, you could choose the size of your input window so that any block of text that you possibly want to put in would fit (if you know beforehand how long your paragraphs will be). Or it might be that you can just cut your input sequence at an arbitrary letter and your model will learn to make sense of it anyway (I am not an expert in neural machine translation).

So, as you can see, my answer is maybe not completely on the spot but I hope it is helpful to you anyway.

Context

StackExchange Computer Science Q#69175, answer score: 2

Revisions (0)

No revisions yet.