[squeak-dev] The Inbox: Files-cmm.125.mcz

Levente Uzonyi leves at elte.hu
Thu Jul 4 08:09:46 UTC 2013


On Wed, 3 Jul 2013, Chris Muller wrote:

> Good discussion thanks Levente.  You know, after more thinking I had
> trouble understanding what is the characteristic of the FilePlugin
> primitives that leads to better performance in a BFS?

With the current DFS implementation (you use FileDirectory >> #entries 
from FileDirectory #directoryTreeDo:entries: which creates an Array with 
a DirectoryEntry object for all files in that directory) you keep all 
DirectoryEntry instances in the memory up to the root during the 
iteration.
With BFS you don't have to do that, because once you start processing a 
directory, you won't start another before you finish the current one.
You could avoid storing the entries in memory with DFS too - by using 
FileDirectory >> #entriesDo: from Files-ul.125, but that will potentially 
expose the bottleneck present in the FilePlugin. Here's a 
small demonstration:

path := '/usr/bin'.
"Single process using the FilePlugin"
[ (FileDirectory on: path) directoryTreeDo: [ :each | Processor yield ] ] timeToRun.
"1208"

"2 processes using the FilePlugin"
[
 	(FileDirectory on: path) directoryTreeDo: [ :each |
 		Processor yield ] ] fork.
[ (FileDirectory on: path) directoryTreeDo: [ :each |
 		Processor yield ] ] timeToRun.
"56716"

What's happening?
Opening a directory and getting the nth element of it is not cheap, but 
that's what #primLookupEntryIn:index: offers. To speed up things, 
FilePlugin stores the last opened directory and the index of the last 
accessed file in it. If you ask for the next file from the same directory, 
then it'll reuse the stored directory, and everything will seem fast. But 
if you open another directory, or ask for any file other than the next one 
from the same directory, then it'll be much slower.

For example asking for the files from a directory, but in reversed order 
will take O(n*n) time, where n is the number of files in the directory.

>
> The only use-case I can think of would be where the doBlock of
> directoryTreeDo: could be short-circuited by a non-local return, such

I don't see how this is related to BFS vs DFS, or my suggestion to change 
the argument of directoryTreeDo:.

> as a detect operation.  But in a stats operation, it seems like the
> same requests to FilePlugin are made in each case.  That is, every
> subdirectory gets read just once via the #entries primitive in either
> case BFS or DFS..

Indeed, the difference is in the memory usage.

>
> But if the whole sub-tree is going to be accessed it seems like a
> wash, and indeed I'm getting pretty close times between my and your
> version of directoryTreeDo::
>
> |dir| dir:= FileDirectory default containingDirectory containingDirectory.
> [dir directoryTreeDo: [ : e |  ]] timeToRun
>
>  (BFS) "444"
>  (DFS) "428"

Did you running the benchmark on /usr/ from 
http://lists.squeakfoundation.org/pipermail/squeak-dev/2013-June/171827.html 
?


Levente

>
> I still believe a directoryTreeDetect: operation has potential for
> less exercise of FilePlugin under BFS, but did I miss something for a
> full tree-traversal?
>
> Thanks.
>
>
>
> On Wed, Jul 3, 2013 at 12:48 AM, Levente Uzonyi <leves at elte.hu> wrote:
>> On Sun, 30 Jun 2013, Chris Muller wrote:
>>
>>> Ah, thank you!  Now I understand Levente's comments.
>>>
>>> Levente, that is an excellent point about directoryTreeDo: traversing
>>> depth-first.  In fact that makes me want to change it to be
>>> breadth-first..  I'm trying to think about all the use-cases it serves
>>> and whether that would make a difference; since we have the path,
>>> maybe not..
>>
>>
>> The only techinal hurdle in changing the algorithm is the argument of
>> argument the block. Currently it is an OrderedCollection of DirectoryEntry
>> objects from the root to the current DirectoryEntry object. With DFS this is
>> naturally available, but with BFS it's not.
>>
>> So the question is, what should we do?
>>
>> 1) keep the existing OrderedCollection argument
>>
>> I uploaded Files-ul.125 to the Inbox, which changes the implementation of
>> #directoryTreeDo: this way. As you can see, it's not an easy to understand
>> algorithm. :)
>>
>> 2) use some kind of linked list-like data structure instead.
>>
>> Here's an example implementation for this:
>>
>>         | queue argument |
>>         argument := { self directoryEntry. nil }.
>>         oneArgBlock value: argument.
>>         queue := OrderedCollection with: argument copy.
>>         [ queue isEmpty ] whileFalse: [
>>                 | parent |
>>                 parent := queue removeFirst.
>>                 argument at: 2 put: parent.
>>                 parent first asFileDirectory entriesDo: [ :entry |
>>                         argument at: 1 put: entry.
>>                         oneArgBlock value: argument.
>>                         entry isDirectory ifTrue: [ queue addLast: argument
>> copy ] ] ]
>>
>> 3) only pass the current DirectoryEntry to the block and let the user decide
>> if he needs the whole path (though it costs a lot to recreate it)
>>
>> Implementation:
>>
>>         | queue |
>>         queue := OrderedCollection with: self directoryEntry.
>>         oneArgBlock value: queue first.
>>         [ queue isEmpty ] whileFalse: [
>>                 queue removeFirst asFileDirectory entriesDo: [ :entry |
>>                         oneArgBlock value: entry.
>>                         entry isDirectory ifTrue: [ queue addLast: entry ] ]
>> ]
>>
>> Both examples at 2) and 3) require the #entriesDo: method, which is
>> available in Files-ul.125 from the Inbox.
>>
>> If you profile any of this code on a larger directory tree, you'll see that
>> there's plenty of room for improvement in FileDirectory.
>>
>>
>> Levente
>>
>>
>>>
>>> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <frank.shearar at gmail.com>
>>> wrote:
>>>>
>>>> On 30 June 2013 18:19, Chris Muller <asqueaker at gmail.com> wrote:
>>>>>
>>>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>>>
>>>>
>>>> Depth-first versus breadth-first search.
>>>>
>>>> frank
>>>>
>>>>>  My claim that
>>>>> statsForDirectoryTree: "does one useless thing, and does it
>>>>> inefficiently" is commentary about the _design_, not the
>>>>> implementation, because it forces me to enumerate an entire directory
>>>>> even if I don't want to.  For example, what if I want to exclude a
>>>>> ".temp" directory's stats from its containing directory's stats?  To
>>>>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>>>>> going to be faster..
>>>>>
>>>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <leves at elte.hu> wrote:
>>>>>>
>>>>>> On Fri, 28 Jun 2013, commits at source.squeak.org wrote:
>>>>>>
>>>>>>> A new version of Files was added to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>>>
>>>>>>> ==================== Summary ====================
>>>>>>>
>>>>>>> Name: Files-cmm.125
>>>>>>> Author: cmm
>>>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>>>> Ancestors: Files-fbs.124
>>>>>>>
>>>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of
>>>>>>> FileDirectory
>>>>>>> living up to its poor reputation.  It does one useless thing, and does
>>>>>>> it
>>>>>>> inefficiently.  It has no senders, deservedly.
>>>>>>>         Tree-walking operations like gathering stats can be written in
>>>>>>> apps more generally using directoryTreeDo:.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>>>>>> used
>>>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS,
>>>>>> because of
>>>>>> the way FilePlugin works: you have to iterate over the contents of the
>>>>>> whole
>>>>>> directory if you want to it to be efficient.
>>>>>> If you use DFS, then the contents of all parent directories up to the
>>>>>> root
>>>>>> will have to be stored in memory during the iteration, while if you use
>>>>>> BFS,
>>>>>> then only the sibling directories (without their contents) or the
>>>>>> siblings's
>>>>>> parents will be in memory.
>>>>>>
>>>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>>>> optimizes memory usage for BFS:
>>>>>>
>>>>>> statsForDirectoryTree2: rootedPathName
>>>>>>
>>>>>>         | dirs files bytes todo |
>>>>>>
>>>>>>         dirs := files := bytes := 0.
>>>>>>         todo := OrderedCollection with: rootedPathName.
>>>>>>         [ todo isEmpty ] whileFalse: [
>>>>>>                 | p |
>>>>>>                 p := todo removeFirst.
>>>>>>                 self directoryContentsFor: p do: [ :entry |
>>>>>>                         entry isDirectory
>>>>>>                                 ifTrue: [
>>>>>>
>>>>>>                                         todo addLast: p , self
>>>>>> pathNameDelimiter asString , entry name.
>>>>>>                                         dirs := dirs + 1]
>>>>>>                                 ifFalse: [
>>>>>>
>>>>>>                                         files := files + 1.
>>>>>>                                         bytes := bytes + entry fileSize
>>>>>> ] ]
>>>>>> ].
>>>>>>         ^{ dirs. files. bytes }
>>>>>>
>>>>>> This one uses #directoryTreeDo::
>>>>>>
>>>>>> statsForDirectoryTree3: rootedPathName
>>>>>>
>>>>>>         | dirs files bytes |
>>>>>>
>>>>>>         dirs := files := bytes := 0.
>>>>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>>>                 | entry |
>>>>>>                 entry := path last.
>>>>>>                 entry isDirectory
>>>>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>>>>                         ifFalse: [
>>>>>>
>>>>>>                                 files := files + 1.
>>>>>>                                 bytes := bytes + entry fileSize ] ].
>>>>>>         ^{ dirs. files. bytes }
>>>>>>
>>>>>> Let's see the numbers:
>>>>>>
>>>>>> "Make sure the disk won't be accessed"
>>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>>> "Run the benchmark"
>>>>>> #(statsForDirectoryTree: statsForDirectoryTree2:
>>>>>> statsForDirectoryTree3:)
>>>>>> collect: [ :each |
>>>>>>         each -> ((1 to: 5) collect: [ :run |
>>>>>>                 Smalltalk garbageCollect.
>>>>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>>>>> timeToRun ]) ].
>>>>>>
>>>>>> {
>>>>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>>>
>>>>>> It's a bit noisy (i was too lazy to exit other programs), but the one
>>>>>> using
>>>>>> #directoryTreeDo: seems to be clearly slower than the existing
>>>>>> algorithm
>>>>>> using BFS.
>>>>>>
>>>>>>
>>>>>> Levente
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>>>
>>>>>>> Item was changed:
>>>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>>> 'private') -----
>>>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>>> 'enumeration') -----
>>>>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry
>>>>>>> itself."
>>>>>>>         self entries do:
>>>>>>>                 [ : each |
>>>>>>>                 entriesCollection add: each.
>>>>>>>                 oneArgBlock value: entriesCollection.
>>>>>>>                 each isDirectory ifTrue:
>>>>>>>                         [ | subdir |
>>>>>>>                         subdir := each asFileDirectory.
>>>>>>>                         subdir
>>>>>>>                                 directoryTreeDo: oneArgBlock
>>>>>>>                                 entries: entriesCollection ].
>>>>>>>                 entriesCollection removeLast ]!
>>>>>>>
>>>>>>> Item was removed:
>>>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>>>> 'enumeration') -----
>>>>>>> - statsForDirectoryTree: rootedPathName
>>>>>>> -       "Return the size statistics for the entire directory tree
>>>>>>> starting
>>>>>>> at the given root. The result is a three element array of the form:
>>>>>>> (<number
>>>>>>> of folders><number of files><total bytes in all files>). This method
>>>>>>> also
>>>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>>>> -
>>>>>>> -       | dirs files bytes todo entries p |
>>>>>>> -       dirs := files := bytes := 0.
>>>>>>> -       todo := OrderedCollection with: rootedPathName.
>>>>>>> -       [todo isEmpty] whileFalse: [
>>>>>>> -               p := todo removeFirst.
>>>>>>> -               entries := self directoryContentsFor: p.
>>>>>>> -               entries do: [:entry |
>>>>>>> -                       entry isDirectory
>>>>>>> -                               ifTrue: [
>>>>>>> -                                       todo addLast: p , self
>>>>>>> pathNameDelimiter asString , entry name.
>>>>>>> -                                       dirs := dirs + 1]
>>>>>>> -                               ifFalse: [
>>>>>>> -                                       files := files + 1.
>>>>>>> -                                       bytes := bytes + entry
>>>>>>> fileSize]]].
>>>>>>> -       ^ Array with: dirs with: files with: bytes
>>>>>>> - !
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>


More information about the Squeak-dev mailing list