Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.
"Finds first occurance of #Sting" self findString: ( Character cr asString, Character lf asString). "Breaks at either token value" self findTokens: ( Character cr asString, Character lf asString)
I have tried poking around in #MultiByteFileStream, but keep running into errors.
If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:
TIA, jrm
----- Image ----- C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image Squeak5.1 latest update: #16549 Current Change Set: PDFPlayground Image format 68021 (64 bit)
Operating System Details ------------------------ Operating System: Windows 7 Professional (Build 7601 Service Pack 1) Registered Owner: T530 Registered Company: SP major version: 1 SP minor version: 0 Suite mask: 100 Product type: 1
Hi John,
Windows files normally use CR/LF as line termination. Linux files normally use LF. Look at the #subStrings: and friends. You may want to change all CR/LF to LF and then all CR to LF and then split the file at LFs. You could also look into the various stream classes.
There are lots of ways to do this and if you are just learning, it doesn't hurt to try a few of them.
Lou
On Tue, 25 Jul 2017 06:00:25 +1200, John-Reed Maffeo jrmaffeo@gmail.com wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.
"Finds first occurance of #Sting" self findString: ( Character cr asString, Character lf asString). "Breaks at either token value" self findTokens: ( Character cr asString, Character lf asString)
I have tried poking around in #MultiByteFileStream, but keep running into errors.
If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:
TIA, jrm
Image
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image Squeak5.1 latest update: #16549 Current Change Set: PDFPlayground Image format 68021 (64 bit)
Operating System Details
Operating System: Windows 7 Professional (Build 7601 Service Pack 1) Registered Owner: T530 Registered Company: SP major version: 1 SP minor version: 0 Suite mask: 100 Product type: 1
Hi JRM,
I think MultiByteFileStream is where you want to work on this. Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.
There are tricks to making it work, which aren't clearly documented (unfortunately).
This looks like how the MultiByteFileStream is supposed to work:
1. Open the file. 2. Send #wantsLineEndConversoin: true to the file. 3. Send #ascii to the file (to tell it is a text file, and to determine the Cr/Lf or Cr or Lf encoding) 4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.
Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy. But, please try this and see if it works. If so, please let me know.
An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.
If you try this route, please let me know how it goes as well.
Thanks, cbc
On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo jrmaffeo@gmail.com wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.
"Finds first occurance of #Sting" self findString: ( Character cr asString, Character lf asString). "Breaks at either token value" self findTokens: ( Character cr asString, Character lf asString)
I have tried poking around in #MultiByteFileStream, but keep running into errors.
If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:
TIA, jrm
Image
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\ Squeak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image Squeak5.1 latest update: #16549 Current Change Set: PDFPlayground Image format 68021 (64 bit)
Operating System Details
Operating System: Windows 7 Professional (Build 7601 Service Pack 1) Registered Owner: T530 Registered Company: SP major version: 1 SP minor version: 0 Suite mask: 100 Product type: 1
Beginners mailing list Beginners@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/beginners
Chris, Lou, Thanks. After more research on the web, I think I need to rethink my approach to the problem. "" PDF's are actually designed to be read "backwards" starting at the end. "" My question is still valid and I am working on a solution. Will post something if it is useful.
-jrm
On Mon, Jul 24, 2017 at 5:25 PM, Chris Cunningham cunningham.cb@gmail.com wrote:
Hi JRM,
I think MultiByteFileStream is where you want to work on this. Since you said it is, specifically, a file that has Cr/Lf line endings, then this is the place.
There are tricks to making it work, which aren't clearly documented (unfortunately).
This looks like how the MultiByteFileStream is supposed to work:
- Open the file.
- Send #wantsLineEndConversoin: true to the file.
- Send #ascii to the file (to tell it is a text file, and to determine
the Cr/Lf or Cr or Lf encoding) 4. Read data from file. It should convert Cr/Lf to just Cr, and all things are happy.
Except if you send something like #next: 20, and the last character isn't a #Cr, then it looks like it would be buggy. But, please try this and see if it works. If so, please let me know.
An alternative seems to be that you could just open it without any of those changes, and go through the file line by line (sending #nextLine to the file), and the implementation of #nextLine in PositionableStream should also take care of the Cr/Lf issues.
If you try this route, please let me know how it goes as well.
Thanks, cbc
On Mon, Jul 24, 2017 at 11:00 AM, John-Reed Maffeo jrmaffeo@gmail.com wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
I have been exploring in #String and #ByteString and this is all I have found that is close to what I need.
"Finds first occurance of #Sting" self findString: ( Character cr asString, Character lf asString). "Breaks at either token value" self findTokens: ( Character cr asString, Character lf asString)
I have tried poking around in #MultiByteFileStream, but keep running into errors.
If there is no existing method, any suggestions how to write a new one? My naive approach is to scan for CR and then peek for LF keeping track of my pointers and using them to identify the CR/LF delimited substrings; or iterate through contents using #findString:
TIA, jrm
Image
C:\Smalltalk\Squeak5.1-16549-64bit-201608180858-Windows\Sque ak5.1-16549-64bit-201608180858-Windows\Squeak5.1-16549-64bit.1.image Squeak5.1 latest update: #16549 Current Change Set: PDFPlayground Image format 68021 (64 bit)
Operating System Details
Operating System: Windows 7 Professional (Build 7601 Service Pack 1) Registered Owner: T530 Registered Company: SP major version: 1 SP minor version: 0 Suite mask: 100 Product type: 1
Beginners mailing list Beginners@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/beginners
Beginners mailing list Beginners@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/beginners
On 24/07/17 20:00, John-Reed Maffeo wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
You know about the work by Christian Haider? http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk
Stephan
Stephan,
Thank you, I have seen this reference, but the framework seems to be written in VisualWorks. While it looks like what I need, I am not ready to put the effort into the process of getting up to speed in the VisualWorks ecosystem. There does appear to be a way to download it, but, since I am focused on my (hobby) project and the learning opportunity it provides, I will continue to develop a framework in Squeak.
There is a page referenced on the link you provided that discusses porting, so perhaps, someday, I will take a look at it. From what I have learned so far PDF is a gnarly mess of pointers and offsets which have to be carefully managed - hard fun!
BTW, I owe the list a reply to this thread about my discovery of the answer to my question.
jrm
On Wed, Aug 16, 2017 at 2:22 PM, Stephan Eggermont stephan@stack.nl wrote:
On 24/07/17 20:00, John-Reed Maffeo wrote:
Is there an existing method that will tokenize/chunk(?) data from a file using CR/LF? The use case is to decompose a file into PDF objects defined as strings are strings terminated by CR/LF. (if there is an existing framework/project available, I have not found it, just dead ends :-(
You know about the work by Christian Haider? http://christianhaider.de/dokuwiki/doku.php?id=pdf:pdf4smalltalk
Stephan
Beginners mailing list Beginners@lists.squeakfoundation.org http://lists.squeakfoundation.org/mailman/listinfo/beginners
On 04-09-17 19:32, John-Reed Maffeo wrote:
Thank you, I have seen this reference, but the framework seems to be written in VisualWorks. While it looks like what I need, I am not ready to put the effort into the process of getting up to speed in the VisualWorks ecosystem. There does appear to be a way to download it, but, since I am focused on my (hobby) project and the learning opportunity it provides, I will continue to develop a framework in Squeak.
Christian recently ported it to Gemstone. He has done some work that would help with porting, and yes, it would be a large project. The file parsing part is likely to be rather well portable, though. A project going in the other direction is Artefact, on Pharo. Adding file parsing to that would be welcomed, I'm sure, and it is probably easy to port.
Stephan
beginners@lists.squeakfoundation.org