[OOTB-hive] Tesseract OCR Consuming more than 100% CPU

Anand K anand.k at dieutek.com
Tue Oct 13 12:23:46 BST 2020


Dear Heiko,

Thank you very much for your reply. Appreciating your valuable efforts in
sending me a reply.

We had also arrived at a conclusion as you mentioned in your mail.

In order to have a clear understanding of our situation, kindly see the
below points.

1. We Integrated a tesseract OCR based add on into alfresco, where the OCR
option was available with the various document actions.

2. The customer will be uploading documents from various different
applications into alfresco through API

3. The documents will be pushed into a folder where an OCR business rule is
applied.

4. Whenever the OCR starts operating, the CPU usage is going higher and
when there are so many documents to be performed with OCR, Alfresco is
getting stuck and slow.

5. CPU Usage will be normal only when the OCR stops working. So in
production, so many documents will be coming in where OCR needs to be
applied and this is making their system slow.

This is the exact scenario.

And as per your suggestion, the OCR should be performed outside and we are
looking for such a solution.

We had gone through some tutorials/forums where we got similar solutions.

Please let me know your thoughts on this.

Once again, with extreme happiness, i extend my gratitude towards you.

Looking forward to hearing from you!

On Tue, Oct 13, 2020 at 3:38 PM Heiko Robert <
heiko+orderofthebee.info at ecm4u.de> wrote:

> Hi Anand,
>
> to enable us to help you, you have to explain in more detail what
> exactly you have implemented and what exactly the problem is. There is
> no default feature to do OCR with Alfresco.
>
> Independant from your specific implementation I always recommend not to
> even save a document in Alfresco if this is to be used for OCR. Instead
> do your OCR processing _before_ you save that doc in Alfresco.
>
> This recommendation has several reasons:
> * Alfresco's transformation engine has no concept to replace a node
> which is expected in most OCR scenarios.
> * as long you don't use the EE Transform Service any transformation will
> run in a local thread consuming CPU. As long you don't implement your
> own custom queuing mechanism for CE this will not change and will not
> scale.
> * Folder rules are always running in a local thread and consume CPU
> unless they only create a job in a queue somewhere. Choosing "async" in
> a folder rule only means that it should run in a independant transaction
> * storing a node first without OCR means always to store the same doc at
> least twice which is not wanted/required in most cases.
>
> To make the long story short:
> You could avoid a lot of hastle not to involve Alfresco at all for doing
> the OCR. For our customers we do all the processing and automation
> outside of alfresco. To create renditions or replace the content
> property of a node we implemented our own queuing mechanism delegating
> the transformation to a professional transforming engine using REST
> which includes PDF handling (OCR, data extraction, PDF/A validation,
> optimization). But as said we try to run that transformation before we
> store the doc iside of Alfresco.
>
> Regards
> Heiko
>
>
> Am 12.10.2020 um 20:33 schrieb Anand K:
> > Hi,
> >
> > We recently delivered Alfresco Comunity 6.2 with Tesseract OCR to a
> > particular Client in a CentOS-based server.
> >
> > OCR was integrated successfully.
> >
> > The client applied Business Rule on a folder and the files were pushed
> > to the folder through API from different 3rd party applications.
> >
> > Whenever OCR is applied, the CPU usage is high and Alfresco gets stuck
> > during this such that no other operations can be done on it.
> >
> > Can you please help us resolve this issue? I read somewhere about
> > allocating a separate server for OCR alone. We don't have much
> > experience with the same. Can you please help us?
>
>

-- 


*Thanks and Regards,Anand Kurian*
*Chief Executive Officer*
*Mob:  +91 9496821365*
*          +91 7012287156*
*e-mail:* anand.k at dieutek.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xtreamlab.net/pipermail/ootb-hive/attachments/20201013/4650a75f/attachment.html>


More information about the OOTB-hive mailing list