I have this code to get specific data from a pdf that is already converted to a string. basically, this is the string i have after that.
JavaScript
x
6
1
Valor del Fondo (Mill COP)
2
1,308,906.95
3
Valor fondo de
4
inversión (Mill COP)
5
230,942.51 Inversión inicial mínima (COP)
6
I need a regular expression that captures de numbers only, I expect something like this: [1308906.95, 230942.51]
this is my NodeJS code
JavaScript
1
12
12
1
const fs = require('fs');
2
const pdfparse = require('pdf-parse');
3
4
const pdffile = fs.readFileSync('testdoc3.pdf');
5
6
pdfparse(pdffile).then(function (data) {
7
var myre = /(V|v)alors(del)?(s)?(fondo)(s)?(de)?(s)?(inversi(ó|o)n)?/gim
8
var array = myre.exec(data.text);
9
console.log(array[0]);
10
});
11
12
this is the code I have so far, I would really appreciate your help since I have tried a lot. Thanks.
Advertisement
Answer
You can use
JavaScript
1
7
1
const text = 'Valor del Fondo (Mill COP)n1,308,906.95nValor fondo deninversión (Mill COP)nn 230,942.51 Inversión inicial mínima (COP)\';
2
console.log(
3
Array.from(text.matchAll(
4
/valor(?:s+del)?s+fondo(?:s+des+inversi[óo]n)?D*(d(?:[.,d]*d)?)/gi),
5
x=>x[1])
6
.map(x => x.replace(/,/g, ''))
7
);
See the regex demo. Regex details:
valor
– avalor
string(?:s+del)?
– an optional sequence of one or more whitespaces and thendel
s+
– one or more whitespacesfondo
– a fixed string(?:s+des+inversi[óo]n)?
– an optional sequence of one or more whitespaces,de
, one or more whitespaces,inversion
D*
– zero or more non-digit chars(d(?:[.,d]*d)?)
– Group 1: a digit and then an optional sequence of zero or more digits, commas or dots and then a digit.
String#matchAll
finds all non-overlapping occurrences, Array.from(..., x=>x[1])
gets Group 1 values and .map(x => x.replace(/,/g, '')
removes commas from the values obtained.