Skip to content
Advertisement

regular expression to capture pdf data in nodejs

I have this code to get specific data from a pdf that is already converted to a string. basically, this is the string i have after that.

Valor del Fondo (Mill COP)
1,308,906.95
Valor fondo de
inversión  (Mill COP)
                           230,942.51 Inversión inicial mínima (COP)

I need a regular expression that captures de numbers only, I expect something like this: [1308906.95, 230942.51]

this is my NodeJS code

const fs = require('fs');
const pdfparse = require('pdf-parse');

const pdffile = fs.readFileSync('testdoc3.pdf');

pdfparse(pdffile).then(function (data) {
   var myre = /(V|v)alors(del)?(s)?(fondo)(s)?(de)?(s)?(inversi(ó|o)n)?/gim
   var array = myre.exec(data.text);
   console.log(array[0]);
});

this is the code I have so far, I would really appreciate your help since I have tried a lot. Thanks.

Advertisement

Answer

You can use

const text = 'Valor del Fondo (Mill COP)n1,308,906.95nValor fondo deninversión  (Mill COP)nn                          230,942.51 Inversión inicial mínima (COP)\';
console.log(
  Array.from(text.matchAll(
    /valor(?:s+del)?s+fondo(?:s+des+inversi[óo]n)?D*(d(?:[.,d]*d)?)/gi),
    x=>x[1])
  .map(x => x.replace(/,/g, ''))
);

See the regex demo. Regex details:

  • valor – a valor string
  • (?:s+del)? – an optional sequence of one or more whitespaces and then del
  • s+ – one or more whitespaces
  • fondo – a fixed string
  • (?:s+des+inversi[óo]n)? – an optional sequence of one or more whitespaces, de, one or more whitespaces, inversion
  • D* – zero or more non-digit chars
  • (d(?:[.,d]*d)?) – Group 1: a digit and then an optional sequence of zero or more digits, commas or dots and then a digit.

String#matchAll finds all non-overlapping occurrences, Array.from(..., x=>x[1]) gets Group 1 values and .map(x => x.replace(/,/g, '') removes commas from the values obtained.

Advertisement