SlideShare a Scribd company logo
1 of 9
Export from PDF to Excel
Overview and Steps
2
Problem with Converting PDFs to Excel
PDFs are usually one of the most readable formats for viewing data but converting them to Excel sheets
is a hard because:
1. We need a format with simple primitives and no structured information
2. There's no equivalent of a table component in PDF files as tables are created with straight lines and
coloured backgrounds
3. As tables in PDFs are drawn like images, detecting or extracting a table is a complex process
4. PDFs created by digital image or by scanning a printed file have distorted lines and no textual
elements
3
How does Exporting Scanned PDF to Excel
Work?
1. PDF to Word/Excel/Direct Text converters are used to copy the information
2. OCR (Optical Character Recognition) engine is used to read the PDF and then to copy its contents in
a different format like simple text
3. Additional programming like PDFMiner (Python-based) or TIka (Java-based) is required to process
the text into the required format or store them in tabular format
4. Code snippets written to push formatted data to Excel or configure online APIs if it’s Google Sheets
4
Methods to Detect Tables in Textual PDFs
Detecting Tables Using Stream: This technique is used to parse tables that have whitespaces between
cells to simulate a table structure. Basically, identifying the place where the text isn't present.
Detecting Tables Using Lattice: Compared to the stream technique, Lattice is more deterministic in
nature. It first parses through tables that have defined lines between cells. It can automatically parse
multiple tables present on a page.
5
Identifying Tables with Python and
Computer Vision
Computer Vision can help us find the borders, edges, and cells to identify tables.
1. The first step is to convert the PDF into images because CV algorithms are implemented on images
2. Inverse image thresholding and dilation technique can enhance the data in the given image to obtain
the image contours
3. Iterate over the contours list to plot the output using matplotlib
6
Identifying Tables with Deep Learning
Data Collection: Deep-learning based approaches are data-intensive and require large volumes of
training data for learning effective representations.
Data Preprocessing: This step is the most common thing for any machine learning or data-science
based problem. It mainly involves understanding the type of document we're working on.
Table Row-Column Annotations: After processing the documents, we'll have to generate annotations
for all the pages in the document. These annotations are basically masks for table and column.
Building a Model: The Model is the heart of the deep learning algorithm. It essentially involves designing
and implementing a neural network. Usually, for datasets containing scanned copies, Convolutional
Neural Networks are widely employed.
7
Business Benefits of Automating the PDF to
Excel Process
1. Reduces the time needed to search and copy/paste the required information manually
2. Reduces the probability of typos and other errors during manual extraction
3. By automating the PDFs to Excel conversion, we can easily integrate your data with any third-party
software
4. Business efficiency can be improved by automating the entire extraction pipeline and running it on a
batch of PDF files to get all desired information in one go
8
Existing Solutions that Convert PDFs to Excel
1. Nanonets
2. EasePDF
3. pdftoexcel
4. PDFZilla
5. Adobe Acrobat PDF to Excel
9
Learn more about
exporting from PDF
to Excel:
https://nanonets.com/blog/pdf-to-excel/

More Related Content

What's hot

optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
BIJIT GHOSH
 

What's hot (20)

Presentation on OCR
Presentation on OCRPresentation on OCR
Presentation on OCR
 
State-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition caseState-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition case
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
 
05a
05a05a
05a
 
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
 
Handwritten digit recognition using image processing
Handwritten digit recognition using image processing Handwritten digit recognition using image processing
Handwritten digit recognition using image processing
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Handwritten Character Recognition
Handwritten Character RecognitionHandwritten Character Recognition
Handwritten Character Recognition
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
 
An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
 

Similar to PDF to Excel

Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
Carmen Sanborn
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020 Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020
Malasy45
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
RAKESHG79
 

Similar to PDF to Excel (20)

IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A Review
 
“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”
 
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
 
Sulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_ScienceSulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_Science
 
The Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for EveryoneThe Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for Everyone
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
ClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdfClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdf
 
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptxEMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
 
Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020 Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020
 
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data structure
Data structureData structure
Data structure
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 
Development of Information Extraction for Data Analysis using NLP
Development of Information Extraction for Data Analysis using NLPDevelopment of Information Extraction for Data Analysis using NLP
Development of Information Extraction for Data Analysis using NLP
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 

PDF to Excel

  • 1. Export from PDF to Excel Overview and Steps
  • 2. 2 Problem with Converting PDFs to Excel PDFs are usually one of the most readable formats for viewing data but converting them to Excel sheets is a hard because: 1. We need a format with simple primitives and no structured information 2. There's no equivalent of a table component in PDF files as tables are created with straight lines and coloured backgrounds 3. As tables in PDFs are drawn like images, detecting or extracting a table is a complex process 4. PDFs created by digital image or by scanning a printed file have distorted lines and no textual elements
  • 3. 3 How does Exporting Scanned PDF to Excel Work? 1. PDF to Word/Excel/Direct Text converters are used to copy the information 2. OCR (Optical Character Recognition) engine is used to read the PDF and then to copy its contents in a different format like simple text 3. Additional programming like PDFMiner (Python-based) or TIka (Java-based) is required to process the text into the required format or store them in tabular format 4. Code snippets written to push formatted data to Excel or configure online APIs if it’s Google Sheets
  • 4. 4 Methods to Detect Tables in Textual PDFs Detecting Tables Using Stream: This technique is used to parse tables that have whitespaces between cells to simulate a table structure. Basically, identifying the place where the text isn't present. Detecting Tables Using Lattice: Compared to the stream technique, Lattice is more deterministic in nature. It first parses through tables that have defined lines between cells. It can automatically parse multiple tables present on a page.
  • 5. 5 Identifying Tables with Python and Computer Vision Computer Vision can help us find the borders, edges, and cells to identify tables. 1. The first step is to convert the PDF into images because CV algorithms are implemented on images 2. Inverse image thresholding and dilation technique can enhance the data in the given image to obtain the image contours 3. Iterate over the contours list to plot the output using matplotlib
  • 6. 6 Identifying Tables with Deep Learning Data Collection: Deep-learning based approaches are data-intensive and require large volumes of training data for learning effective representations. Data Preprocessing: This step is the most common thing for any machine learning or data-science based problem. It mainly involves understanding the type of document we're working on. Table Row-Column Annotations: After processing the documents, we'll have to generate annotations for all the pages in the document. These annotations are basically masks for table and column. Building a Model: The Model is the heart of the deep learning algorithm. It essentially involves designing and implementing a neural network. Usually, for datasets containing scanned copies, Convolutional Neural Networks are widely employed.
  • 7. 7 Business Benefits of Automating the PDF to Excel Process 1. Reduces the time needed to search and copy/paste the required information manually 2. Reduces the probability of typos and other errors during manual extraction 3. By automating the PDFs to Excel conversion, we can easily integrate your data with any third-party software 4. Business efficiency can be improved by automating the entire extraction pipeline and running it on a batch of PDF files to get all desired information in one go
  • 8. 8 Existing Solutions that Convert PDFs to Excel 1. Nanonets 2. EasePDF 3. pdftoexcel 4. PDFZilla 5. Adobe Acrobat PDF to Excel
  • 9. 9 Learn more about exporting from PDF to Excel: https://nanonets.com/blog/pdf-to-excel/