KevinHuSh
		
	commited on
		
		
					Commit 
							
							·
						
						6943c52
	
1
								Parent(s):
							
							41c7a59
								
refine README (#72)
Browse files* refine README
* Update README.md
- deepdoc/README.md +3 -11
    	
        deepdoc/README.md
    CHANGED
    
    | @@ -1,8 +1,6 @@ | |
| 1 | 
             
            English | [简体中文](./README_zh.md)
         | 
| 2 |  | 
| 3 | 
            -
             | 
| 4 | 
            -
             | 
| 5 | 
            -
            ---
         | 
| 6 |  | 
| 7 | 
             
            - [1. Introduction](#1)
         | 
| 8 | 
             
            - [2. Vision](#2)
         | 
| @@ -11,7 +9,6 @@ English | [简体中文](./README_zh.md) | |
| 11 | 
             
            <a name="1"></a>
         | 
| 12 | 
             
            ## 1. Introduction
         | 
| 13 |  | 
| 14 | 
            -
            ---
         | 
| 15 | 
             
            With a bunch of documents from various domains with various formats and along with diverse retrieval requirements, 
         | 
| 16 | 
             
            an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
         | 
| 17 | 
             
            There 2 parts in *Deep*Doc so far: vision and parser.
         | 
| @@ -19,8 +16,6 @@ There 2 parts in *Deep*Doc so far: vision and parser. | |
| 19 | 
             
            <a name="2"></a>
         | 
| 20 | 
             
            ## 2. Vision
         | 
| 21 |  | 
| 22 | 
            -
            ---
         | 
| 23 | 
            -
             | 
| 24 | 
             
            We use vision information to resolve problems as human being.
         | 
| 25 | 
             
              - OCR. Since a lot of documents presented as images or at least be able to transform to image, 
         | 
| 26 | 
             
                OCR is a very essential and fundamental or even universal solution for text extraction.
         | 
| @@ -64,19 +59,16 @@ We use vision information to resolve problems as human being. | |
| 64 | 
             
            <a name="3"></a>
         | 
| 65 | 
             
            ## 3. Parser
         | 
| 66 |  | 
| 67 | 
            -
            ---
         | 
| 68 | 
            -
             | 
| 69 | 
             
            Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser. 
         | 
| 70 | 
             
            The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
         | 
| 71 | 
             
              - Text chunks with their own positions in PDF(page number and rectangular positions).
         | 
| 72 | 
             
              - Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
         | 
| 73 | 
             
              - Figures with caption and text in the figures.
         | 
| 74 |  | 
| 75 | 
            -
            ###Résumé
         | 
| 76 |  | 
| 77 | 
            -
            ---
         | 
| 78 | 
             
            The résumé is a very complicated kind of document. A résumé which is composed of unstructured text 
         | 
| 79 | 
             
            with various layouts could be resolved into structured data composed of nearly a hundred of fields.
         | 
| 80 | 
             
            We haven't opened the parser yet, as we open the processing method after parsing procedure.
         | 
| 81 |  | 
| 82 | 
            -
                
         | 
|  | |
| 1 | 
             
            English | [简体中文](./README_zh.md)
         | 
| 2 |  | 
| 3 | 
            +
            # *Deep*Doc
         | 
|  | |
|  | |
| 4 |  | 
| 5 | 
             
            - [1. Introduction](#1)
         | 
| 6 | 
             
            - [2. Vision](#2)
         | 
|  | |
| 9 | 
             
            <a name="1"></a>
         | 
| 10 | 
             
            ## 1. Introduction
         | 
| 11 |  | 
|  | |
| 12 | 
             
            With a bunch of documents from various domains with various formats and along with diverse retrieval requirements, 
         | 
| 13 | 
             
            an accurate analysis becomes a very challenge task. *Deep*Doc is born for that purpose.
         | 
| 14 | 
             
            There 2 parts in *Deep*Doc so far: vision and parser.
         | 
|  | |
| 16 | 
             
            <a name="2"></a>
         | 
| 17 | 
             
            ## 2. Vision
         | 
| 18 |  | 
|  | |
|  | |
| 19 | 
             
            We use vision information to resolve problems as human being.
         | 
| 20 | 
             
              - OCR. Since a lot of documents presented as images or at least be able to transform to image, 
         | 
| 21 | 
             
                OCR is a very essential and fundamental or even universal solution for text extraction.
         | 
|  | |
| 59 | 
             
            <a name="3"></a>
         | 
| 60 | 
             
            ## 3. Parser
         | 
| 61 |  | 
|  | |
|  | |
| 62 | 
             
            Four kinds of document formats as PDF, DOCX, EXCEL and PPT have their corresponding parser. 
         | 
| 63 | 
             
            The most complex one is PDF parser since PDF's flexibility. The output of PDF parser includes:
         | 
| 64 | 
             
              - Text chunks with their own positions in PDF(page number and rectangular positions).
         | 
| 65 | 
             
              - Tables with cropped image from the PDF, and contents which has already translated into natural language sentences.
         | 
| 66 | 
             
              - Figures with caption and text in the figures.
         | 
| 67 |  | 
| 68 | 
            +
            ### Résumé
         | 
| 69 |  | 
|  | |
| 70 | 
             
            The résumé is a very complicated kind of document. A résumé which is composed of unstructured text 
         | 
| 71 | 
             
            with various layouts could be resolved into structured data composed of nearly a hundred of fields.
         | 
| 72 | 
             
            We haven't opened the parser yet, as we open the processing method after parsing procedure.
         | 
| 73 |  | 
| 74 | 
            +
                
         |