autoKYC

</div>

⚡ AutoKYC ⚡

Named entity extraction from financial documents with OpenCV, Pytesseract, Spacy (OCR + NER)

</hr>

<img src="https://user-images.githubusercontent.com/26667491/221348114-87651e73-19f7-4d65-9080-0cfda72e0246.gif" height=300' width='600'>

Economic Impact

<img src="https://github.com/user-attachments/assets/b588be59-0b2c-45a2-919f-af4ed4cef75d" height=300' width='600'>

Manual KYC checks

<img src="https://github.com/user-attachments/assets/12e3ca6b-4f5f-40c3-af0e-caecd82634f4" height=300' width='600'>

Automated KYC checks

<img src="https://github.com/user-attachments/assets/5629d9fe-0134-4ffc-8721-0fa1db8bd403" height=300' width='600'>

The global identity verification market is expected to grow from $9.5 billion in 2022 to $18.6 billion by 2027. So the trend toward automation is becoming ever more apparent.

`Development Stages`

`Training Architecture -(NER Model)`

`Architecture`

`Text Detection WorkFlow`

Image Preprocessing (Suppressing unwanted distortions, Enhancing important Image Features)

Binarization
Rescaling
Dilation

Image Segmentation (breaking image based on)

Single Character
Word
Line

`Labeling - BIO/IOB Tagging Format`

Most chronophagous task, took around more than 10 good hours per day and some weeks
Learning – Collecting good data in Real Life is not a cakewalk

## `Bounding Boxes`

## `Input - Real Time` **Eyeballing Scanned results of very common and easy input point you can get in Real Time, input can be anything in range of crazy to very crazy**

## `NER Prediction` **You are observing NER Prediction on scanned results of above business card**
**Finding organisation and name is still bit difficult , `clearly I have to increase business card data from 3000+ cards to maybe 10000+, in parallel I need to update my approach a bot more to bit more maybe`**

### `Problem Statement` Develop customized Named Entity Recognizer to extract entities from scanned documents images like: 1. Invoice 2. Business Card [my focus] || Extract Entities like: Name, Phone, Email, Organisation and Website link 3. Shipping Bill etc ### `Technologies used` 1. Compute Vision modules were used to: 1. scan document 2. identify location of text 3. extract text from image 2. Natural Language Processing used to 1. extract entitles from text 2. text cleaning 3. parsing entities form text ### `Python Libraries used in Computer Vision Module`

### `Python Libraries used in Natural Language Processing`

### `Flow to Extract Entities` 1. Location of Entity 2. Text of Corresponding Entity ### `Some more NER use-cases`

</p> ### `Improvements:` 1. I am using Spacy NER model, which is a `BERT architecture` i.e. I have to provide more data to this model to see performance improvement 2. I can also improve `Data Preparation Framework` 3. I am using PyTesseract(google) to extract text, it have some limitations like: 4. Image resolution must be atlest `200 dpi` or width & height must be atlest `300 pixels` 1. Text must not be Rotated or Skewed 2. Text must not be having some effets applied on it 3. Text must not be blured 4. Text must not be cursive handwriting ### `Refrences` * [Skew Detection and Correction of Document images using Hough Transform](https://muthu.co/skew-detection-and-correction-of-document-images-using-hough-transform/) * [Skew Detection and Correction of Devanagari Script Using Hough Transform](https://www.researchgate.net/publication/274142211_Skew_Detection_and_Correction_of_Devanagari_Script_Using_Hough_Transform) ### `What Next` * [Understanding this](https://dropbox.tech/machine-learning/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning)

### `Trouble : Choosing vendors for automated KYC verification`

<img src="https://github.com/user-attachments/assets/5debc3e9-1686-4119-b99f-4d0cb5246b35" height=300' width='600'>