</div>
⚡ AutoKYC ⚡
Named entity extraction from financial documents with OpenCV, Pytesseract, Spacy (OCR + NER)
</hr>
<img src=https://img.shields.io/badge/Built%20using-Python-yellow>
<img src="https://user-images.githubusercontent.com/26667491/221348114-87651e73-19f7-4d65-9080-0cfda72e0246.gif" height=300' width='600'>
Economic Impact
<img src="https://github.com/user-attachments/assets/b588be59-0b2c-45a2-919f-af4ed4cef75d" height=300' width='600'>
Manual KYC checks
<img src="https://github.com/user-attachments/assets/12e3ca6b-4f5f-40c3-af0e-caecd82634f4" height=300' width='600'>
Automated KYC checks
<img src="https://github.com/user-attachments/assets/5629d9fe-0134-4ffc-8721-0fa1db8bd403" height=300' width='600'>
The global identity verification market is expected to grow from $9.5 billion in 2022 to $18.6 billion by 2027. So the trend toward automation is becoming ever more apparent.
Development Stages
Training Architecture -(NER Model)
Architecture
Text Detection WorkFlow
Labeling - BIO/IOB Tagging Format
Most chronophagous task, took around more than 10 good hours per day and some weeks
Learning
– Collecting good data in Real Life is not a cakewalk
## `Bounding Boxes`
## `Input - Real Time` **Eyeballing Scanned results of very common and easy input point you can get in Real Time, input can be anything in range of crazy to very crazy**
## `NER Prediction`
**You are observing NER Prediction on scanned results of above business card**
**Finding organisation and name is still bit difficult , `clearly I have to increase business card data from 3000+ cards to maybe 10000+, in parallel I need to update my approach a bot more to bit more maybe`**
</p>
### `Improvements:`
1. I am using Spacy NER model, which is a `BERT architecture` i.e. I have to provide more data to this model to see performance improvement
2. I can also improve `Data Preparation Framework`
3. I am using PyTesseract(google) to extract text, it have some limitations like:
4. Image resolution must be atlest `200 dpi` or width & height must be atlest `300 pixels`
1. Text must not be Rotated or Skewed
2. Text must not be having some effets applied on it
3. Text must not be blured
4. Text must not be cursive handwriting
### `Refrences`
* [Skew Detection and Correction of Document images using Hough Transform](https://muthu.co/skew-detection-and-correction-of-document-images-using-hough-transform/)
* [Skew Detection and Correction of Devanagari Script Using Hough Transform](https://www.researchgate.net/publication/274142211_Skew_Detection_and_Correction_of_Devanagari_Script_Using_Hough_Transform)
### `What Next`
* [Understanding this](https://dropbox.tech/machine-learning/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning)
<img src="https://github.com/user-attachments/assets/5debc3e9-1686-4119-b99f-4d0cb5246b35" height=300' width='600'>
### `Trouble : Choosing vendors for automated KYC verification`