Announcement

Collapse
No announcement yet.

Howto parse .txt file? Need to get in format for database load

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Howto parse .txt file? Need to get in format for database load

    Hello,

    I have a .txt file with lots of data I have saved over time, what is the best (free) software that will allow me to get it into a .csv file for inserting into a database.

    The 'records' in the text file are all separated by a string of identifiable characters, so I will know when a new one starts. First part of the 'record' will have a bunch of short fields delimited by '/'s, but after the nth '/', it would be a long stretch of description until the next 'record' starts.

    Will python work? How about Windows PowerShell.

    Input from text file, output to csv file? Is the csv format the best of output? Will be using MS-Access for database

    Thanks
    Better to remain silent and be thought a fool, than to speak and remove all doubt.

    #2
    Python will do it (google something like "python parse strings" or "python split strings"). Perl or Ruby would be other good quick and dirty options.

    CSV or tab delimited files will probably be fine for a one time import. If you're planning to do this a bunch, I'd probably format the output as JSON matching the database schema.

    Edited to add: if you use CSV, and any of your parsed strings have commas in them, be sure to enclose them in quotes:

    "this string, it contains a comma"
    Last edited by Katja; 24 Jun 2015, 11:11 PM.

    Comment


      #3
      cat foo.txt | sed 's/\//,/g' > foo.csv

      That is on linux. If it isn't full of private info just message me and I can convert it in 2 seconds. If it is download sed from cygwin

      Originally posted by Joe-MN View Post
      Hello,

      I have a .txt file with lots of data I have saved over time, what is the best (free) software that will allow me to get it into a .csv file for inserting into a database.

      The 'records' in the text file are all separated by a string of identifiable characters, so I will know when a new one starts. First part of the 'record' will have a bunch of short fields delimited by '/'s, but after the nth '/', it would be a long stretch of description until the next 'record' starts.

      Will python work? How about Windows PowerShell.

      Input from text file, output to csv file? Is the csv format the best of output? Will be using MS-Access for database

      Thanks

      Comment


        #4
        Thanks for the offer. Data is not really private, but it is dirty (as in not all formatted the same, NOT XXX!!!). I will try to do it myself first. Can you point me to someplace with the syntax of the Linux command that you are suggesting to use? I have used Ubuntu just a little. Thanks

        Originally posted by t8burst View Post
        cat foo.txt | sed 's/\//,/g' > foo.csv

        That is on linux. If it isn't full of private info just message me and I can convert it in 2 seconds. If it is download sed from cygwin
        Better to remain silent and be thought a fool, than to speak and remove all doubt.

        Comment


          #5
          I use AWK others use Perl. See the AWK book.
          http://zagam.net/

          Comment


            #6
            Originally posted by Joe-MN View Post
            Thanks for the offer. Data is not really private, but it is dirty (as in not all formatted the same, NOT XXX!!!). I will try to do it myself first. Can you point me to someplace with the syntax of the Linux command that you are suggesting to use? I have used Ubuntu just a little. Thanks
            http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

            There are 50 different ways to do what you want. But if all you need to do is replace "/" with "," then sed is easy. If you have a ton of files I would probably use perl but for a couple sed is fine. sed is just a way to pipe regular expressions through a file. If you want to get fancy google "regular expressions" and embrace the cryptic world of regex
            Last edited by t8burst; 26 Jun 2015, 9:56 AM.

            Comment


              #7
              CSV is harmful

              The problem with CSV is that it uses commas, which are normal printable characters. Then it hides them with quotes that are also normal printable characters. Quotes and escapes are harmful and are an attack vector. I have actually exploited the C <string.h> and <stdio.h> '\0' with real programs.

              In Unix you have tab as a field separator FS and line feed as a record separator RS. If you exclude these non printing characters from your data they will always be valid FS or RS. My C program below replaces any FS or RS with a space when it converts CSV to TSV.

              With Unix data files do not need to load data before you can use it as is the case with XML, S-Expressions, CSV, etc.

              If sorted with sort(1) you can query with the look(1) command. Unix is a 4GL in that it operates on sets of data rather than elements. It also has file security to avoid severe data loss. Read the man pages.

              Code:
              LC_COLLATE=C
              export LC_COLLATE
              # Exact
              look -b "$KEY^I" file
              # Partial
              look -b "$KEY" file. 
              # Use join(1) and cut(1)
              look -b "$KEY^I" file1 | join -t'^I' - file2 | cut -d'^I' -f1-3
              # Use comm() for other set operations
              # use sort() to sort and or merge data
              sort -m -t'^I' -k1 file1 file2 >file3
              # This can be done in a lock with umask(1)
              # Read only lock file created open for writing
              # Write new data in lock file
              # Make readable and then move in to place
              umask 0226
              {
              sort -m -u -t'^I' -k1 file1 file2
              chmod --reference=file3 file3.lock
              mv file3.lock file3
              } >file3.lock
              Code:
              #!/usr/bin/tcc -run
              /*BINFMTC: -O -Wuninitialized -Werror -Wfatal-errors -pedantic-errors
              csv2txt.c - A Microsoft CSV reader - Andrew Buckeridge
              1997,Ford,E350,"Super, ""luxurious"" truck"^M
              1997,Ford,E350,"Go get one now^M
              they are going fast"^M
              http://en.wikipedia.org/wiki/Comma-separated_values
              http://tools.ietf.org/html/rfc4180
              http://www.iana.org/assignments/medi...parated-values
              */
              #include <stdlib.h>
              #include <stdio.h>
              
              int main()
              {
                      int c, o, inquot;
              
                      for (inquot=0,o=' ',c=getchar_unlocked();c!=EOF;o=c,c=getchar_unlocked()) {
                                      switch (c) {
                                      case '\t':
                                              c=' ';
                                              break;
                                      case '\"':
                                              inquot=!inquot;
                                              if (o=='\"') { /* This one an escape */
                                                      putchar_unlocked(c);
                                                      c=' '; /* Next one a quote */
                                              }
                                              continue;
                                      case ',':
                                              if (!inquot)
                                                      c='\t';
                                              break;
                                      case '\r':
                                              continue;
                                      case '\n':
                                              if (inquot)
                                                      c=' ';
                                      }
                                      putchar_unlocked(c);
                      }
              
                      exit(0); /* fflush() <stdio.h> */
              
                      return 0; /* Not reached, but main() returns this */
              }
              Last edited by zagam; 30 Jun 2015, 5:45 AM. Reason: data loss ;(
              http://zagam.net/

              Comment


                #8
                You crack me up. Do even read the original posts when you respond? The guy had a text file where the fields were delineated by "/" and he wanted to change it to "," and then presumably load it into excel (OP: come to think of it, if you are using Excel, go to the "data" tab, click on "import text file" then when following the wizard make "/" the column separator, you don't even need to preprocess the file).

                How in any way is that "dangerous"? In fact there is nothing dangerous about csv files anyway. Any programmer with half a brain can write a CSV parser that can't be used to do any sort of buffer injection or virus loading.

                Originally posted by zagam View Post
                The problem with CSV is that it uses commas, which are normal printable characters. Then it hides them with quotes that are also normal printable characters. Quotes and escapes are harmful and are an attack vector. I have actually exploited the C <string.h> and <stdio.h> '\0' with real programs.

                In Unix you have tab as a field separator FS and line feed as a record separator RS. If you exclude these non printing characters from your data they will always be valid FS or RS. My C program below replaces any FS or RS with a space when it converts CSV to TSV.

                With Unix data files do not need to load data before you can use it as is the case with XML, S-Expressions, CSV, etc.

                If sorted with sort(1) you can query with the look(1) command. Unix is a 4GL in that it operates on sets of data rather than elements. It also has file security to avoid severe data loss. Read the man pages.

                Code:
                LC_COLLATE=C
                export LC_COLLATE
                # Exact
                look -b "$KEY^I" file
                # Partial
                look -b "$KEY" file. 
                # Use join(1) and cut(1)
                look -b "$KEY^I" file1 | join -t'^I' - file2 | cut -d'^I' -f1-3
                # Use comm() for other set operations
                # use sort() to sort and or merge data
                sort -m -t'^I' -k1 file1 file2 >file3
                # This can be done in a lock with umask(1)
                # Read only lock file created open for writing
                # Write new data in lock file
                # Make readable and then move in to place
                umask 0226
                {
                sort -m -u -t'^I' -k1 file1 file2
                chmod --reference=file3 file3.lock
                mv file3.lock file3
                } >file3.lock
                Code:
                #!/usr/bin/tcc -run
                /*BINFMTC: -O -Wuninitialized -Werror -Wfatal-errors -pedantic-errors
                csv2txt.c - A Microsoft CSV reader - Andrew Buckeridge
                1997,Ford,E350,"Super, ""luxurious"" truck"^M
                1997,Ford,E350,"Go get one now^M
                they are going fast"^M
                http://en.wikipedia.org/wiki/Comma-separated_values
                http://tools.ietf.org/html/rfc4180
                http://www.iana.org/assignments/medi...parated-values
                */
                #include <stdlib.h>
                #include <stdio.h>
                
                int main()
                {
                        int c, o, inquot;
                
                        for (inquot=0,o=' ',c=getchar_unlocked();c!=EOF;o=c,c=getchar_unlocked()) {
                                        switch (c) {
                                        case '\t':
                                                c=' ';
                                                break;
                                        case '\"':
                                                inquot=!inquot;
                                                if (o=='\"') { /* This one an escape */
                                                        putchar_unlocked(c);
                                                        c=' '; /* Next one a quote */
                                                }
                                                continue;
                                        case ',':
                                                if (!inquot)
                                                        c='\t';
                                                break;
                                        case '\r':
                                                continue;
                                        case '\n':
                                                if (inquot)
                                                        c=' ';
                                        }
                                        putchar_unlocked(c);
                        }
                
                        exit(0); /* fflush() <stdio.h> */
                
                        return 0; /* Not reached, but main() returns this */
                }

                Comment

                Working...
                X