Skip to content Skip to sidebar Skip to footer

im trying to manage a lot of pdf documents with a filemaker database, what is the ebst solution

  • Remove From My Forums

 locked

go information from pdf

  • Question

  • Howdy,

    Im wondering how tin can i get information from a pdf file?

    for example

    i have the name of a person and the emails

    steve

    steve@example.com; stevosky@instance.com

    how can i get the emails:steve@instance.com; stevosky@instance.com and split information technology?email1= steve@example.com and email2= stevosky@case.com

    plz help me its important

Answers

  • How-do-you-do,

    To beginning this off, I have been doing extractions from PDF documents for a long time using a 3rd party library every bit it is much amend than the gratis libraries out there only non going into it because it's over $1000 price tag.

    I would suggest looking at utilizing iTextSharp library with code such as below. Best to install this library from NuGet within of Visual Studio. From Solution Explorer right click on the solution proper name and select manage NuGet packages for the solution.

    Original lawmaking beneath (see 2d code block for VB.NET) came from StackOverFlow

    using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; using System.IO;  public cord ReadPdfFile(cord fileName) {     StringBuilder text = new StringBuilder();      if (File.Exists(fileName))     {         PdfReader pdfReader = new PdfReader(fileName);          for (int folio = i; folio <= pdfReader.NumberOfPages; page++)         {             ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();             string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);              currentText = Encoding.UTF8.GetString(ASCIIEncoding.Catechumen(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));             text.Suspend(currentText);         }         pdfReader.Shut();     }     return text.ToString(); }

    I took this, added it to a class projection equally per below

    Imports Arrangement.Text Imports System.IO Imports iTextSharp.text.pdf Imports iTextSharp.text.pdf.parser  Public Form Extracter     Public Belongings FileName Equally Cord     Public Sub New()      End Sub     Public Role ReadPdfFile() Every bit String         Dim text As New StringBuilder()          If File.Exists(FileName) Then             Dim pdfReader Equally New PdfReader(FileName)              For folio Equally Integer = one To pdfReader.NumberOfPages                  Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()                 Dim currentText Equally String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)                  currentText = Encoding.UTF8.GetString(                     ASCIIEncoding.Convert(                         Encoding.Default,                         Encoding.UTF8,                         Encoding.Default.GetBytes(currentText))                 )                  ' four/21/2015 Karen Payne added the post-obit                 currentText = currentText.Replace(vbLf, Surround.NewLine)                  text.Append(currentText)                 text.AppendLine(Environment.NewLine)              Next folio             pdfReader.Close()         Else             '             ' You decide on how to handle file not plant             '         Terminate If          Return text.ToString()      End Function End Class                    

    Created a panel project, added a reference to the class library every bit per higher up and so just extracted text and saved to a text file to ensure information technology works.

    Imports iTextSharpHelper Module Module1     Sub Primary()         Dim Extracter Equally New Extracter With             {                 .FileName = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "File1.pdf")             }         IO.File.WriteAllText(             IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "file1.txt"),             Extracter.ReadPdfFile)     End Sub End Module                    

    The method ReadPdfFile returns a cord which you lot tin can then employ to find and excerpt text, run across the cord form for methods to find things in strings.

    Annotate, the above 'Every bit IS' treats extracted text as a string while the library I utilize treats each folio as a memory stream with a single method to convert the stream to a List(Of String).

    Hope this helps.


    Please remember to mark the replies as answers if they assistance and unmark them if they provide no aid, this will help others who are looking for solutions to the same or similar trouble. Contact via my webpage nether my contour simply exercise not reply to forum questions.

    • Proposed as reply by Tuesday, April 21, 2015 five:33 PM
    • Marked as answer by Ko0kiE Wed, April 22, 2015 iv:01 PM
  • I have never tried reading from PDF files, merely there are libraries available to permit you work with them. The one I hear of most ofttimes is iTextSharp. If you demand help using the library that you pick, yous will need to find the back up website for that product.

    If you need help splitting a string that looks like this "steve@example.com; stevosky@example.com", y'all can utilise the String.Separate method like this.

    Dim combined As String = "steve@example.com; stevosky@example.com" Dim emails() As String = combined.Split("; ".ToCharArray, StringSplitOptions.RemoveEmptyEntries)

    The event will exist a string array with 2 elements:

    • emails(0) will incorporate "steve@example.com"
    • emails(i) volition contain "stevosky@example.com"

    • Edited by Blackwood Tuesday, April 21, 2015 ane:28 PM
    • Marked as answer by Ko0kiE Wednesday, April 22, 2015 4:00 PM
  • In this example i used the small-scale grade that Kevin provided with a small change to information technology to pass the filename every bit a parameter argument instead of using a separate belongings.  I also used RegularExpressions to find all the electronic mail addresses in the text that is read from the pdf file and listed them in a RichTextBox.  You tin can experiment with it to make it work the mode you want.

    Form1 Code - Form1 has 1 Button and i RichTextBox added to it.

    Imports System.Text.RegularExpressions  Public Class Form1     Private txtExtractor Every bit New Extracter      Private Sub Button1_Click(ByVal sender As System.Object, ByVal due east As System.EventArgs) Handles Button1.Click         Using ofd As New OpenFileDialog             ofd.Filter = "Pdf files|*.pdf"             If ofd.ShowDialog = DialogResult.OK Then                 'read the text from the pdf file                 Dim pdftxt As String = txtExtractor.ReadPdfFile(ofd.FileName)                  'detect all e-mail addresses                  Dim rgx As New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-9]{ane,3}\.[0-9]{one,three}\.[0-nine]{one,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{ii,4}|[0-ix]{i,three})")                 Dim mtchs As MatchCollection = rgx.Matches(pdftxt)                  'adds each electronic mail address that was establish to a RichTextBox                 For Each m As Match In mtchs                     RichTextBox1.AppendText(chiliad.Value & vbNewLine)                 Next             End If         End Using     End Sub Cease Form                    

     The lawmaking i used for the Extractor class Kevin posted.

    Imports System.Text Imports System.IO Imports iTextSharp.text.pdf Imports iTextSharp.text.pdf.parser  Public Class Extracter     Public Sub New()     End Sub      Public Function ReadPdfFile(ByVal filename Every bit String) As String         Dim text As New StringBuilder()          If File.Exists(filename) And so             Dim pdfReader As New PdfReader(filename)              For folio As Integer = 1 To pdfReader.NumberOfPages                 Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()                 Dim currentText As Cord = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)                  currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))                  currentText = currentText.Supplant(vbLf, Environment.NewLine)                  text.Append(currentText)                 text.AppendLine(Environment.NewLine)             Next             pdfReader.Close()         End If          Return text.ToString()     Cease Function Cease Class

     After opening a pocket-size pdf file i made that has 4 electronic mail addresses in information technology this is the result.


    If you say it can`t be washed then i`ll try it

    • Proposed every bit answer past Kareninstructor MVP Tuesday, April 21, 2015 7:27 PM
    • Unproposed as answer by Ko0kiE Midweek, April 22, 2015 2:47 PM
    • Marked every bit answer past Ko0kiE Midweek, April 22, 2015 4:00 PM
  • Do you have all of the pdf file paths stored in a List(Of String) or an Array already? If you lot practise and so you lot can use a For Each loop to iterate through the file paths and get the links from each file.

     If y'all only desire to be able to select more than one file using the OpenFileDialog so y'all can set its MultiSelect holding to True and then iterate through each file.

                          Using ofd As New OpenFileDialog             ofd.Filter = "Pdf files|*.pdf"             ofd.Multiselect = True 'this allows you to select more than one file at a time              If ofd.ShowDialog = DialogResult.OK And then                  For Each fn As String In ofd.FileNames 'iterate through each filename that was selected                      'read the text from the pdf file                     Dim pdftxt As String = txtExtractor.ReadPdfFile(fn)                      'find all email addresses                      Dim rgx Every bit New Regex("([a-zA-Z0-9_\-\.]+)@((\[[0-ix]{i,3}\.[0-nine]{1,three}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{ii,4}|[0-9]{1,3})")                     Dim mtchs As MatchCollection = rgx.Matches(pdftxt)                      'adds each electronic mail address that was institute to a RichTextBox                     'you can add them to a List(Of Cord) instead, if yous want                     For Each yard Every bit Match In mtchs                         RichTextBox1.AppendText(m.Value & vbNewLine)                     Adjacent                  Next              End If         End Using


    If y'all say information technology can`t be done then i`ll attempt it

    • Edited by IronRazerz Wednesday, April 22, 2015 three:23 PM
    • Marked equally reply past Ko0kiE Midweek, April 22, 2015 iii:33 PM

harrisonspere1953.blogspot.com

Source: https://social.msdn.microsoft.com/Forums/windows/en-US/d20f8b5f-ab8b-41e7-adb4-56a79c8a550f/get-data-from-pdf

Post a Comment for "im trying to manage a lot of pdf documents with a filemaker database, what is the ebst solution"