Questions? Feedback? powered by Olark live chat software

How we approach OCR for billions of PDF pages

We've explained a lot about what we want to do at Patdek. The foundation is to make all PTO documents findable/searchable. That's the baseline; that's the content. With PTO documents text-searchable, we layer on tools. That's the second part of the marriage of content and tools.

This post is meant to walk through what is involved with processing 3 billion pages of PDF documents. The short answer is that it could take more than 14 years if you used just a single 16 CPU server.

Let's think through this problem. Here's a quick primer on how to think about processing a large number of pages for just OCR.

Basically you need to know several values, understanding that this is merely an approximation.

  • N - number of pages to processed
  • T - time in seconds to process a page/CPU
  • CPU - number of server CPUs
  • R - total time in seconds to process N pages
  • Rh - total time in hours to process N pages

From the primer linked above, we'll use that site's wishful estimate of 2.4 seconds/page processing time for a medium complexity project. They don't explain any aspect of image pre-processing or error thresholds achieved. In practice, 2.4 seconds likely underestimates the time to process a single page.

Assuming a single core, performing OCR on 500,000 pages the processing time would be:

R = T (pages/ sec for 1 CPU) * N (number of pages) / CPU (number of server CPUs)  

Rh = R/3600 (convert from seconds to hours)

R = 2.4  * 500,000 / 1 = 1,200,000 secs

Rh = 333 hours (nearly 2 weeks)

Now let's assume you have a dedicated server with a 16 CPU core for OCR. Total time to OCR 500,000 pages is 20.8 hours. Less than a day now.

So how long does it take to OCR 3,000,000,000 pages for that same server having a 16 CPU core? 125,000 hours or 5,208 days or 14.3 years. Assuming you could split this work up among 100 servers, you're still looking at 1,250 hours or 52 days; basically 2 months.

The problem is, processing time varies widely based on page complexity and different pre-processing steps for any given page. Typical processing time is 2-3 times greater than the original assumption of 2.4 seconds/page. At 5 to 7 seconds/page, processing time increases greatly, along with processing cost. And that's just to OCR the 3 billion pages, without doing anything else.