[OOTB-hive] Tesseract OCR Consuming more than 100% CPU

heiko+orderofthebee.info at ecm4u.de heiko+orderofthebee.info at ecm4u.de
Tue Oct 13 11:24:10 BST 2020


Hi Anand,

to enable us to help you, you have to explain in more detail what 
exactly you have implemented and what exactly the problem is. There is 
no default feature to do OCR with Alfresco.

Independant from your specific implementation I always recommend not to 
even save a document in Alfresco if this is to be used for OCR. Instead 
do your OCR processing _before_ you save that doc in Alfresco.

This recommendation has several reasons:
* Alfresco's transformation engine has no concept to replace a node 
which is expected in most OCR scenarios.
* as long you don't use the EE Transform Service any transformation will 
run in a local thread consuming CPU. As long you don't implement your 
own custom queuing mechanism for CE this will not change and will not scale.
* Folder rules are always running in a local thread and consume CPU 
unless they only create a job in a queue somewhere. Choosing "async" in 
a folder rule only means that it should run in a independant transaction
* storing a node first without OCR means always to store the same doc at 
least twice which is not wanted/required in most cases.

To make the long story short:
You could avoid a lot of hastle not to involve Alfresco at all for doing 
the OCR. For our customers we do all the processing and automation 
outside of alfresco. To create renditions or replace the content 
property of a node we implemented our own queuing mechanism delegating 
the transformation to a professional transforming engine using REST 
which includes PDF handling (OCR, data extraction, PDF/A validation, 
optimization). But as said it's always better to avoid that by 
transforming docs before storing them iside of Alfresco.

Regards
Heiko

Am 12.10.2020 um 20:33 schrieb Anand K:
> Hi,
> 
> We recently delivered Alfresco Comunity 6.2 with Tesseract OCR to a 
> particular Client in a CentOS-based server.
> 
> OCR was integrated successfully.
> 
> The client applied Business Rule on a folder and the files were pushed 
> to the folder through API from different 3rd party applications.
> 
> Whenever OCR is applied, the CPU usage is high and Alfresco gets stuck 
> during this such that no other operations can be done on it.
> 
> Can you please help us resolve this issue? I read somewhere about 
> allocating a separate server for OCR alone. We don't have much 
> experience with the same. Can you please help us?
> -- 
> *
> *
> *
> *Thanks and Regards,*
> *
> *
> Anand Kurian*
> *Chief Executive Officer*
> *Mob: +91 9496821365*
> *          +91 7012287156*
> *e-mail:* anand.k at dieutek.com <mailto:anand.k at dieutek.com>
> 
> 
> 
> _______________________________________________
> OOTB-hive mailing list
> OOTB-hive at lists.xtreamlab.net
> https://lists.xtreamlab.net/mailman/listinfo/ootb-hive
> 

-- 
Heiko Robert [heiko.robert at ecm4u.de]
Consultant / Geschäftsführender Gesellschafter

ecm4u GmbH
http://www.ecm4u.de
Hölderlinplatz 2b
70193 Stuttgart

t: +49 (711) 912775-72
m: +49 (176) 347475-72
f: +49 (711) 912775-80

ecm4u GmbH - die IT in Prozessen einfach sinnvoll nutzen
Handelsregister: Amtsgericht Stuttgart HRB 734004, Geschäftsführung: 
Heiko Robert

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4494 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.xtreamlab.net/pipermail/ootb-hive/attachments/20201013/5daaec8a/attachment.bin>


More information about the OOTB-hive mailing list