[squeak-dev] The Inbox: Files-cmm.125.mcz

Chris Muller asqueaker at gmail.com
Wed Jul 3 18:13:15 UTC 2013


Good discussion thanks Levente.  You know, after more thinking I had
trouble understanding what is the characteristic of the FilePlugin
primitives that leads to better performance in a BFS?

The only use-case I can think of would be where the doBlock of
directoryTreeDo: could be short-circuited by a non-local return, such
as a detect operation.  But in a stats operation, it seems like the
same requests to FilePlugin are made in each case.  That is, every
subdirectory gets read just once via the #entries primitive in either
case BFS or DFS..

But if the whole sub-tree is going to be accessed it seems like a
wash, and indeed I'm getting pretty close times between my and your
version of directoryTreeDo::

|dir| dir:= FileDirectory default containingDirectory containingDirectory.
[dir directoryTreeDo: [ : e |  ]] timeToRun

  (BFS) "444"
  (DFS) "428"

I still believe a directoryTreeDetect: operation has potential for
less exercise of FilePlugin under BFS, but did I miss something for a
full tree-traversal?

Thanks.



On Wed, Jul 3, 2013 at 12:48 AM, Levente Uzonyi <leves at elte.hu> wrote:
> On Sun, 30 Jun 2013, Chris Muller wrote:
>
>> Ah, thank you!  Now I understand Levente's comments.
>>
>> Levente, that is an excellent point about directoryTreeDo: traversing
>> depth-first.  In fact that makes me want to change it to be
>> breadth-first..  I'm trying to think about all the use-cases it serves
>> and whether that would make a difference; since we have the path,
>> maybe not..
>
>
> The only techinal hurdle in changing the algorithm is the argument of
> argument the block. Currently it is an OrderedCollection of DirectoryEntry
> objects from the root to the current DirectoryEntry object. With DFS this is
> naturally available, but with BFS it's not.
>
> So the question is, what should we do?
>
> 1) keep the existing OrderedCollection argument
>
> I uploaded Files-ul.125 to the Inbox, which changes the implementation of
> #directoryTreeDo: this way. As you can see, it's not an easy to understand
> algorithm. :)
>
> 2) use some kind of linked list-like data structure instead.
>
> Here's an example implementation for this:
>
>         | queue argument |
>         argument := { self directoryEntry. nil }.
>         oneArgBlock value: argument.
>         queue := OrderedCollection with: argument copy.
>         [ queue isEmpty ] whileFalse: [
>                 | parent |
>                 parent := queue removeFirst.
>                 argument at: 2 put: parent.
>                 parent first asFileDirectory entriesDo: [ :entry |
>                         argument at: 1 put: entry.
>                         oneArgBlock value: argument.
>                         entry isDirectory ifTrue: [ queue addLast: argument
> copy ] ] ]
>
> 3) only pass the current DirectoryEntry to the block and let the user decide
> if he needs the whole path (though it costs a lot to recreate it)
>
> Implementation:
>
>         | queue |
>         queue := OrderedCollection with: self directoryEntry.
>         oneArgBlock value: queue first.
>         [ queue isEmpty ] whileFalse: [
>                 queue removeFirst asFileDirectory entriesDo: [ :entry |
>                         oneArgBlock value: entry.
>                         entry isDirectory ifTrue: [ queue addLast: entry ] ]
> ]
>
> Both examples at 2) and 3) require the #entriesDo: method, which is
> available in Files-ul.125 from the Inbox.
>
> If you profile any of this code on a larger directory tree, you'll see that
> there's plenty of room for improvement in FileDirectory.
>
>
> Levente
>
>
>>
>> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <frank.shearar at gmail.com>
>> wrote:
>>>
>>> On 30 June 2013 18:19, Chris Muller <asqueaker at gmail.com> wrote:
>>>>
>>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>>
>>>
>>> Depth-first versus breadth-first search.
>>>
>>> frank
>>>
>>>>  My claim that
>>>> statsForDirectoryTree: "does one useless thing, and does it
>>>> inefficiently" is commentary about the _design_, not the
>>>> implementation, because it forces me to enumerate an entire directory
>>>> even if I don't want to.  For example, what if I want to exclude a
>>>> ".temp" directory's stats from its containing directory's stats?  To
>>>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>>>> going to be faster..
>>>>
>>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <leves at elte.hu> wrote:
>>>>>
>>>>> On Fri, 28 Jun 2013, commits at source.squeak.org wrote:
>>>>>
>>>>>> A new version of Files was added to project The Inbox:
>>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>>
>>>>>> ==================== Summary ====================
>>>>>>
>>>>>> Name: Files-cmm.125
>>>>>> Author: cmm
>>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>>> Ancestors: Files-fbs.124
>>>>>>
>>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of
>>>>>> FileDirectory
>>>>>> living up to its poor reputation.  It does one useless thing, and does
>>>>>> it
>>>>>> inefficiently.  It has no senders, deservedly.
>>>>>>         Tree-walking operations like gathering stats can be written in
>>>>>> apps more generally using directoryTreeDo:.
>>>>>
>>>>>
>>>>>
>>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>>>>> used
>>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS,
>>>>> because of
>>>>> the way FilePlugin works: you have to iterate over the contents of the
>>>>> whole
>>>>> directory if you want to it to be efficient.
>>>>> If you use DFS, then the contents of all parent directories up to the
>>>>> root
>>>>> will have to be stored in memory during the iteration, while if you use
>>>>> BFS,
>>>>> then only the sibling directories (without their contents) or the
>>>>> siblings's
>>>>> parents will be in memory.
>>>>>
>>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>>> optimizes memory usage for BFS:
>>>>>
>>>>> statsForDirectoryTree2: rootedPathName
>>>>>
>>>>>         | dirs files bytes todo |
>>>>>
>>>>>         dirs := files := bytes := 0.
>>>>>         todo := OrderedCollection with: rootedPathName.
>>>>>         [ todo isEmpty ] whileFalse: [
>>>>>                 | p |
>>>>>                 p := todo removeFirst.
>>>>>                 self directoryContentsFor: p do: [ :entry |
>>>>>                         entry isDirectory
>>>>>                                 ifTrue: [
>>>>>
>>>>>                                         todo addLast: p , self
>>>>> pathNameDelimiter asString , entry name.
>>>>>                                         dirs := dirs + 1]
>>>>>                                 ifFalse: [
>>>>>
>>>>>                                         files := files + 1.
>>>>>                                         bytes := bytes + entry fileSize
>>>>> ] ]
>>>>> ].
>>>>>         ^{ dirs. files. bytes }
>>>>>
>>>>> This one uses #directoryTreeDo::
>>>>>
>>>>> statsForDirectoryTree3: rootedPathName
>>>>>
>>>>>         | dirs files bytes |
>>>>>
>>>>>         dirs := files := bytes := 0.
>>>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>>                 | entry |
>>>>>                 entry := path last.
>>>>>                 entry isDirectory
>>>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>>>                         ifFalse: [
>>>>>
>>>>>                                 files := files + 1.
>>>>>                                 bytes := bytes + entry fileSize ] ].
>>>>>         ^{ dirs. files. bytes }
>>>>>
>>>>> Let's see the numbers:
>>>>>
>>>>> "Make sure the disk won't be accessed"
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> "Run the benchmark"
>>>>> #(statsForDirectoryTree: statsForDirectoryTree2:
>>>>> statsForDirectoryTree3:)
>>>>> collect: [ :each |
>>>>>         each -> ((1 to: 5) collect: [ :run |
>>>>>                 Smalltalk garbageCollect.
>>>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>>>> timeToRun ]) ].
>>>>>
>>>>> {
>>>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>>
>>>>> It's a bit noisy (i was too lazy to exit other programs), but the one
>>>>> using
>>>>> #directoryTreeDo: seems to be clearly slower than the existing
>>>>> algorithm
>>>>> using BFS.
>>>>>
>>>>>
>>>>> Levente
>>>>>
>>>>>
>>>>>>
>>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>>
>>>>>> Item was changed:
>>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'private') -----
>>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'enumeration') -----
>>>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry
>>>>>> itself."
>>>>>>         self entries do:
>>>>>>                 [ : each |
>>>>>>                 entriesCollection add: each.
>>>>>>                 oneArgBlock value: entriesCollection.
>>>>>>                 each isDirectory ifTrue:
>>>>>>                         [ | subdir |
>>>>>>                         subdir := each asFileDirectory.
>>>>>>                         subdir
>>>>>>                                 directoryTreeDo: oneArgBlock
>>>>>>                                 entries: entriesCollection ].
>>>>>>                 entriesCollection removeLast ]!
>>>>>>
>>>>>> Item was removed:
>>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>>> 'enumeration') -----
>>>>>> - statsForDirectoryTree: rootedPathName
>>>>>> -       "Return the size statistics for the entire directory tree
>>>>>> starting
>>>>>> at the given root. The result is a three element array of the form:
>>>>>> (<number
>>>>>> of folders><number of files><total bytes in all files>). This method
>>>>>> also
>>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>>> -
>>>>>> -       | dirs files bytes todo entries p |
>>>>>> -       dirs := files := bytes := 0.
>>>>>> -       todo := OrderedCollection with: rootedPathName.
>>>>>> -       [todo isEmpty] whileFalse: [
>>>>>> -               p := todo removeFirst.
>>>>>> -               entries := self directoryContentsFor: p.
>>>>>> -               entries do: [:entry |
>>>>>> -                       entry isDirectory
>>>>>> -                               ifTrue: [
>>>>>> -                                       todo addLast: p , self
>>>>>> pathNameDelimiter asString , entry name.
>>>>>> -                                       dirs := dirs + 1]
>>>>>> -                               ifFalse: [
>>>>>> -                                       files := files + 1.
>>>>>> -                                       bytes := bytes + entry
>>>>>> fileSize]]].
>>>>>> -       ^ Array with: dirs with: files with: bytes
>>>>>> - !
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>


More information about the Squeak-dev mailing list