Skip to content
Advertisement

How to pass input object to webworker so it can read slices from a file Javascript

So I create an input object using

var s_curFile;

function JSprocessFilePicker( input )
{
    let url = input.value;
    let ext = url.substring( url.lastIndexOf( '.' ) + 1 ).toLowerCase();
    if ( input.files && input.files[0] && ( ext == "txt" ) )
    {
        s_curFile = input.files[0];

        //TODO send s_curFile to workers
    }
}

var input = document.createElement( "input" );
input.setAttribute( "id", "file_picker" );
input.setAttribute( "type", "file" );
input.setAttribute( "accept", ".txt" );
input.setAttribute( "onchange", "JSprocessFilePicker(this)" );
input.click();

I want to send s_curFile to a web worker so that I can read slices from it on both the main thread and the worker at the same time using XMLHTTPRequest like:

//on both worker and main thread
let xhrReq = new XMLHttpRequest();
xhrReq.overrideMimeType('text/plain; charset=x-user-defined');
//qwOffset and hSize are determined on the thread
let uri = URL.createObjectURL(s_curFile.slice(qwOffset, qwOffset + hSize));
xhrReq.open('GET', uri, false); //can i make it async on workers?
xhrReq.send();
URL.revokeObjectURL(uri);
let Idx;
let sz = xhrReq.response.length;
for (Idx = 0; Idx < sz; ++Idx) {
    //do stuff with response
}

I am only reading the file. So, how would I go about sending s_curFile to the worker so I can do that? I would think you would have to use .postMessage(...) from the main thread to the worker using a SharedArrayBuffer, but how would I populate the buffer? Or is there another means to do it, because I am fairly certain XMLHttpRequest can be done from the worker. (I need this functionality because the size of the local file the user can have is upward of 30 GB, so I can’t have it all in memory due to the per tab memory limitations, and I want the workers to help in processing the sheer amount of data)

Advertisement

Answer

You can simply postMessage() your File object. The underlying data will not be copied over, only the wrapper object.

However note that for reading a File you should not use XMLHttpRequest. In older browsers, you’d use a FileReader (or even FileReaderSync in Web Workers), and their .readAsText() method. In recent browsers you’d use either the File‘s .text() method, that does return a Promise resolving with the content read as UTF-8 text.

However to read a text file as chunk, you need to handle multi-bytes characters. Slicing such character in the middle will break it:

(async () => {
  const file = new File(["😱"], "file.txt");
  const chunk1 = file.slice(0, file.size/2);
  const chunk2 = file.slice(file.size/2);
  const txt1 = await chunk1.text();
  const txt2 = await chunk2.text();
  const all  = await file.text();
  console.log({txt1, txt2, all});
})();

To circumvent that, you need to use a TextDecoder, which is able to keep in memory just the last byte of information to be able to reconstruct the proper character, thanks to its stream option available in the .decode() method.

(async () => {
  const file = new File(["😱"], "file.txt");
  const decoder = new TextDecoder();
  const chunk1 = file.slice(0, file.size/2);
  const chunk2 = file.slice(file.size/2);
  const txt1 = decoder.decode(await chunk1.arrayBuffer(), { stream: true});
  const txt2 = decoder.decode(await chunk2.arrayBuffer(), { stream: true});
  const all  = await file.text();
  // now txt1 is empty and txt2 contains the whole glyph
  console.log({txt1, txt2, all});
})();

But TextDecoders can’t be shared across Workers, so they won’t really help us to handle the chunking issue you may face when splitting your file to different Workers. I’m unfortunately not aware of an easy solution for this case, so it’s your call if the speed gain is worth the risk of breaking a few characters, I know that in my area of the globe, the risk can’t be taken because most characters are concerned.

Anyway, here is a solution that does take this risk and will split your file in as many available CPU cores there are, each processing its own chunk as a stream and returning the number of “A”s it found.

const inp = document.querySelector("input");
// limit our number of parallel Workers to the number of cores - 1 (for UI)
const availableThreads = navigator.hardwareConcurrency - 1;
const workerUrl = buildWorkerURL();
const workers = Array.from({length: availableThreads}, () => new Worker(workerUrl));

inp.addEventListener("change", async (evt) => {
  const file = inp.files[0];
  if (!file.name.endsWith(".txt")) {
    console.log("not a .txt file");
    return;
  }
  const chunkSize = Math.ceil(file.size / workers.length);
  const numberOfAs = (await Promise.all(workers.map((worker, i) => {
    return new Promise((res, rej) => {
      // we use a MessageChannel to be able to promisify the request to the Worker
      // this way we can handle different parallel requests
      const { port1, port2 } = new MessageChannel();
      worker.onerror = rej;
      port2.onmessage = ({data}) => {
        if(isNaN(data)) {
          // You could handle progress events here if you wish
          rej(data);
        }
        res(data);
      };
      // we send only a chunk for convenience
      // the actual data never moves anyway
      const chunk = file.slice(chunkSize * i, chunkSize * (i + 1));
      worker.postMessage(chunk, [port1]);
    });
  })))
    // each worker sent its own count, we have to do the sum here
    .reduce((a, b) => a + b, 0);
  console.log(`The file ${file.name} contains ${numberOfAs} "A"s`);
});


function buildWorkerURL() {
  const scriptContent = document.querySelector("script[type=worker]").textContent;
  const blob = new Blob([scriptContent], {type: "text/javascript"});
  return URL.createObjectURL(blob);
}
<input type=file>
<!-- our worker script -->
<script type=worker>
  onmessage = ({data, ports}) => {
    let found = 0;
    const stream = data.stream();
    const reader = stream.getReader();
    const decoder = new TextDecoder();
    reader.read().then(processChunk);
    
    function processChunk({done, value}) {
      // 'value' is an Uint8Array
      // we decode it as UTF-8 text, with the 'stream' option
      const chunk = decoder.decode(value, { stream: true });
      // do some processing over the chunk of text
      // be careful to NOT leak the data here
      found += (chunk.match(/(a|A)/g)||"").length;
      if (done) {
        // use the sent MessagePort to be able to "promisify"
        // the whole process
        ports[0].postMessage(found);
      }
      else {
        // do it again
        reader.read().then(processChunk);
      }
    }
  };
</script>
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement