SlideShare a Scribd company logo
Export from PDF to Excel
Overview and Steps
2
Problem with Converting PDFs to Excel
PDFs are usually one of the most readable formats for viewing data but converting them to Excel sheets
is a hard because:
1. We need a format with simple primitives and no structured information
2. There's no equivalent of a table component in PDF files as tables are created with straight lines and
coloured backgrounds
3. As tables in PDFs are drawn like images, detecting or extracting a table is a complex process
4. PDFs created by digital image or by scanning a printed file have distorted lines and no textual
elements
3
How does Exporting Scanned PDF to Excel
Work?
1. PDF to Word/Excel/Direct Text converters are used to copy the information
2. OCR (Optical Character Recognition) engine is used to read the PDF and then to copy its contents in
a different format like simple text
3. Additional programming like PDFMiner (Python-based) or TIka (Java-based) is required to process
the text into the required format or store them in tabular format
4. Code snippets written to push formatted data to Excel or configure online APIs if it’s Google Sheets
4
Methods to Detect Tables in Textual PDFs
Detecting Tables Using Stream: This technique is used to parse tables that have whitespaces between
cells to simulate a table structure. Basically, identifying the place where the text isn't present.
Detecting Tables Using Lattice: Compared to the stream technique, Lattice is more deterministic in
nature. It first parses through tables that have defined lines between cells. It can automatically parse
multiple tables present on a page.
5
Identifying Tables with Python and
Computer Vision
Computer Vision can help us find the borders, edges, and cells to identify tables.
1. The first step is to convert the PDF into images because CV algorithms are implemented on images
2. Inverse image thresholding and dilation technique can enhance the data in the given image to obtain
the image contours
3. Iterate over the contours list to plot the output using matplotlib
6
Identifying Tables with Deep Learning
Data Collection: Deep-learning based approaches are data-intensive and require large volumes of
training data for learning effective representations.
Data Preprocessing: This step is the most common thing for any machine learning or data-science
based problem. It mainly involves understanding the type of document we're working on.
Table Row-Column Annotations: After processing the documents, we'll have to generate annotations
for all the pages in the document. These annotations are basically masks for table and column.
Building a Model: The Model is the heart of the deep learning algorithm. It essentially involves designing
and implementing a neural network. Usually, for datasets containing scanned copies, Convolutional
Neural Networks are widely employed.
7
Business Benefits of Automating the PDF to
Excel Process
1. Reduces the time needed to search and copy/paste the required information manually
2. Reduces the probability of typos and other errors during manual extraction
3. By automating the PDFs to Excel conversion, we can easily integrate your data with any third-party
software
4. Business efficiency can be improved by automating the entire extraction pipeline and running it on a
batch of PDF files to get all desired information in one go
8
Existing Solutions that Convert PDFs to Excel
1. Nanonets
2. EasePDF
3. pdftoexcel
4. PDFZilla
5. Adobe Acrobat PDF to Excel
9
Learn more about
exporting from PDF
to Excel:
https://nanonets.com/blog/pdf-to-excel/

More Related Content

What's hot

Presentation on OCR
Presentation on OCRPresentation on OCR
Presentation on OCR
xsconfused
 
State-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition caseState-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition case
Kristina Piltyay
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
eSAT Journals
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
iosrjce
 
05a
05a05a
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
GBM
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
Handwritten digit recognition using image processing
Handwritten digit recognition using image processing Handwritten digit recognition using image processing
Handwritten digit recognition using image processing
anita maharjan
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
Bindu Karki
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
Bharat Thakur
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
Durjoy Saha
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
Vidyut Singhania
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
Viet-Trung TRAN
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Handwritten Character Recognition
Handwritten Character RecognitionHandwritten Character Recognition
Handwritten Character Recognition
Constantine Priemski
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
Er. Ashish Pandey
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
Bharat Kalia
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
BIJIT GHOSH
 
An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...
Abdullah al Mamun
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
Kaushik Godhani
 

What's hot (20)

Presentation on OCR
Presentation on OCRPresentation on OCR
Presentation on OCR
 
State-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition caseState-of-Art Optical Character Recognition case
State-of-Art Optical Character Recognition case
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
 
05a
05a05a
05a
 
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
Offline Omni Font Arabic Optical Text Recognition System using Prolog Classif...
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
 
Handwritten digit recognition using image processing
Handwritten digit recognition using image processing Handwritten digit recognition using image processing
Handwritten digit recognition using image processing
 
Handwriting Recognition
Handwriting RecognitionHandwriting Recognition
Handwriting Recognition
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Final Report on Optical Character Recognition
Final Report on Optical Character Recognition Final Report on Optical Character Recognition
Final Report on Optical Character Recognition
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Handwritten Character Recognition
Handwritten Character RecognitionHandwritten Character Recognition
Handwritten Character Recognition
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Automatic handwriting recognition
Automatic handwriting recognitionAutomatic handwriting recognition
Automatic handwriting recognition
 
An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...An approach to empirical Optical Character recognition paradigm using Multi-L...
An approach to empirical Optical Character recognition paradigm using Multi-L...
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
 

Similar to PDF to Excel

IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
IRJET Journal
 
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET Journal
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
WidsoulDevil
 
Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A Review
IRJET Journal
 
“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”
diannepatricia
 
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
RyanRojasRicablanca
 
Sulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_ScienceSulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_Science
SULTHAN BASHA
 
The Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for EveryoneThe Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for Everyone
Andrea Payant
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
Debabrata Mondal
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
Carmen Sanborn
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
ClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdfClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdf
CarlosLoureno45
 
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptxEMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
JasonPDelosSantos
 
Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020 Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020
Malasy45
 
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Hany Paulina
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Moutasm Tamimi
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
Data structure
Data structureData structure
Data structure
Prof. Dr. K. Adisesha
 
2016 Chapter 2 - Intro. to Data Sciences.pptx
2016  Chapter 2 - Intro. to Data Sciences.pptx2016  Chapter 2 - Intro. to Data Sciences.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx
mussie143tadesse
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
RAKESHG79
 

Similar to PDF to Excel (20)

IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
IRJET- Techniques for Detecting and Extracting Tabular Data from PDFs and Sca...
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
Extract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A ReviewExtract and Analyze Data from PDF File and Web : A Review
Extract and Analyze Data from PDF File and Web : A Review
 
“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”
 
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
6. Applied Productivity Tools with Advanced Application Techniques PPT.pptx
 
Sulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_ScienceSulthan's DBMS for_Computer_Science
Sulthan's DBMS for_Computer_Science
 
The Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for EveryoneThe Missing Link: Metadata Conversion Workflows for Everyone
The Missing Link: Metadata Conversion Workflows for Everyone
 
Abstract.DOCX
Abstract.DOCXAbstract.DOCX
Abstract.DOCX
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
ClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdfClassHandoutMFG321077LaurenAmes.pdf
ClassHandoutMFG321077LaurenAmes.pdf
 
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptxEMPOWERMENT TECHNOLOGY LESSON 3.pptx
EMPOWERMENT TECHNOLOGY LESSON 3.pptx
 
Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020 Excel Basic_Incomplete_Guide_2020
Excel Basic_Incomplete_Guide_2020
 
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
Public Training AS/400 for IT Support & Help Desk ( 14-18 Maret 2016 )
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Data structure
Data structureData structure
Data structure
 
2016 Chapter 2 - Intro. to Data Sciences.pptx
2016  Chapter 2 - Intro. to Data Sciences.pptx2016  Chapter 2 - Intro. to Data Sciences.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx
 
Data Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdfData Science & Big Data - Theory.pdf
Data Science & Big Data - Theory.pdf
 

Recently uploaded

如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 

Recently uploaded (20)

如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 

PDF to Excel

  • 1. Export from PDF to Excel Overview and Steps
  • 2. 2 Problem with Converting PDFs to Excel PDFs are usually one of the most readable formats for viewing data but converting them to Excel sheets is a hard because: 1. We need a format with simple primitives and no structured information 2. There's no equivalent of a table component in PDF files as tables are created with straight lines and coloured backgrounds 3. As tables in PDFs are drawn like images, detecting or extracting a table is a complex process 4. PDFs created by digital image or by scanning a printed file have distorted lines and no textual elements
  • 3. 3 How does Exporting Scanned PDF to Excel Work? 1. PDF to Word/Excel/Direct Text converters are used to copy the information 2. OCR (Optical Character Recognition) engine is used to read the PDF and then to copy its contents in a different format like simple text 3. Additional programming like PDFMiner (Python-based) or TIka (Java-based) is required to process the text into the required format or store them in tabular format 4. Code snippets written to push formatted data to Excel or configure online APIs if it’s Google Sheets
  • 4. 4 Methods to Detect Tables in Textual PDFs Detecting Tables Using Stream: This technique is used to parse tables that have whitespaces between cells to simulate a table structure. Basically, identifying the place where the text isn't present. Detecting Tables Using Lattice: Compared to the stream technique, Lattice is more deterministic in nature. It first parses through tables that have defined lines between cells. It can automatically parse multiple tables present on a page.
  • 5. 5 Identifying Tables with Python and Computer Vision Computer Vision can help us find the borders, edges, and cells to identify tables. 1. The first step is to convert the PDF into images because CV algorithms are implemented on images 2. Inverse image thresholding and dilation technique can enhance the data in the given image to obtain the image contours 3. Iterate over the contours list to plot the output using matplotlib
  • 6. 6 Identifying Tables with Deep Learning Data Collection: Deep-learning based approaches are data-intensive and require large volumes of training data for learning effective representations. Data Preprocessing: This step is the most common thing for any machine learning or data-science based problem. It mainly involves understanding the type of document we're working on. Table Row-Column Annotations: After processing the documents, we'll have to generate annotations for all the pages in the document. These annotations are basically masks for table and column. Building a Model: The Model is the heart of the deep learning algorithm. It essentially involves designing and implementing a neural network. Usually, for datasets containing scanned copies, Convolutional Neural Networks are widely employed.
  • 7. 7 Business Benefits of Automating the PDF to Excel Process 1. Reduces the time needed to search and copy/paste the required information manually 2. Reduces the probability of typos and other errors during manual extraction 3. By automating the PDFs to Excel conversion, we can easily integrate your data with any third-party software 4. Business efficiency can be improved by automating the entire extraction pipeline and running it on a batch of PDF files to get all desired information in one go
  • 8. 8 Existing Solutions that Convert PDFs to Excel 1. Nanonets 2. EasePDF 3. pdftoexcel 4. PDFZilla 5. Adobe Acrobat PDF to Excel
  • 9. 9 Learn more about exporting from PDF to Excel: https://nanonets.com/blog/pdf-to-excel/