[squeak-dev] The Inbox: Files-cmm.125.mcz
Chris Muller
asqueaker at gmail.com
Wed Jul 3 18:13:15 UTC 2013
Good discussion thanks Levente. You know, after more thinking I had
trouble understanding what is the characteristic of the FilePlugin
primitives that leads to better performance in a BFS?
The only use-case I can think of would be where the doBlock of
directoryTreeDo: could be short-circuited by a non-local return, such
as a detect operation. But in a stats operation, it seems like the
same requests to FilePlugin are made in each case. That is, every
subdirectory gets read just once via the #entries primitive in either
case BFS or DFS..
But if the whole sub-tree is going to be accessed it seems like a
wash, and indeed I'm getting pretty close times between my and your
version of directoryTreeDo::
|dir| dir:= FileDirectory default containingDirectory containingDirectory.
[dir directoryTreeDo: [ : e | ]] timeToRun
(BFS) "444"
(DFS) "428"
I still believe a directoryTreeDetect: operation has potential for
less exercise of FilePlugin under BFS, but did I miss something for a
full tree-traversal?
Thanks.
On Wed, Jul 3, 2013 at 12:48 AM, Levente Uzonyi <leves at elte.hu> wrote:
> On Sun, 30 Jun 2013, Chris Muller wrote:
>
>> Ah, thank you! Now I understand Levente's comments.
>>
>> Levente, that is an excellent point about directoryTreeDo: traversing
>> depth-first. In fact that makes me want to change it to be
>> breadth-first.. I'm trying to think about all the use-cases it serves
>> and whether that would make a difference; since we have the path,
>> maybe not..
>
>
> The only techinal hurdle in changing the algorithm is the argument of
> argument the block. Currently it is an OrderedCollection of DirectoryEntry
> objects from the root to the current DirectoryEntry object. With DFS this is
> naturally available, but with BFS it's not.
>
> So the question is, what should we do?
>
> 1) keep the existing OrderedCollection argument
>
> I uploaded Files-ul.125 to the Inbox, which changes the implementation of
> #directoryTreeDo: this way. As you can see, it's not an easy to understand
> algorithm. :)
>
> 2) use some kind of linked list-like data structure instead.
>
> Here's an example implementation for this:
>
> | queue argument |
> argument := { self directoryEntry. nil }.
> oneArgBlock value: argument.
> queue := OrderedCollection with: argument copy.
> [ queue isEmpty ] whileFalse: [
> | parent |
> parent := queue removeFirst.
> argument at: 2 put: parent.
> parent first asFileDirectory entriesDo: [ :entry |
> argument at: 1 put: entry.
> oneArgBlock value: argument.
> entry isDirectory ifTrue: [ queue addLast: argument
> copy ] ] ]
>
> 3) only pass the current DirectoryEntry to the block and let the user decide
> if he needs the whole path (though it costs a lot to recreate it)
>
> Implementation:
>
> | queue |
> queue := OrderedCollection with: self directoryEntry.
> oneArgBlock value: queue first.
> [ queue isEmpty ] whileFalse: [
> queue removeFirst asFileDirectory entriesDo: [ :entry |
> oneArgBlock value: entry.
> entry isDirectory ifTrue: [ queue addLast: entry ] ]
> ]
>
> Both examples at 2) and 3) require the #entriesDo: method, which is
> available in Files-ul.125 from the Inbox.
>
> If you profile any of this code on a larger directory tree, you'll see that
> there's plenty of room for improvement in FileDirectory.
>
>
> Levente
>
>
>>
>> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <frank.shearar at gmail.com>
>> wrote:
>>>
>>> On 30 June 2013 18:19, Chris Muller <asqueaker at gmail.com> wrote:
>>>>
>>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>>
>>>
>>> Depth-first versus breadth-first search.
>>>
>>> frank
>>>
>>>> My claim that
>>>> statsForDirectoryTree: "does one useless thing, and does it
>>>> inefficiently" is commentary about the _design_, not the
>>>> implementation, because it forces me to enumerate an entire directory
>>>> even if I don't want to. For example, what if I want to exclude a
>>>> ".temp" directory's stats from its containing directory's stats? To
>>>> do that, I'd have to run #statsForDirectoryTree: twice. That's not
>>>> going to be faster..
>>>>
>>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <leves at elte.hu> wrote:
>>>>>
>>>>> On Fri, 28 Jun 2013, commits at source.squeak.org wrote:
>>>>>
>>>>>> A new version of Files was added to project The Inbox:
>>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>>
>>>>>> ==================== Summary ====================
>>>>>>
>>>>>> Name: Files-cmm.125
>>>>>> Author: cmm
>>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>>> Ancestors: Files-fbs.124
>>>>>>
>>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of
>>>>>> FileDirectory
>>>>>> living up to its poor reputation. It does one useless thing, and does
>>>>>> it
>>>>>> inefficiently. It has no senders, deservedly.
>>>>>> Tree-walking operations like gathering stats can be written in
>>>>>> apps more generally using directoryTreeDo:.
>>>>>
>>>>>
>>>>>
>>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>>>>> used
>>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS,
>>>>> because of
>>>>> the way FilePlugin works: you have to iterate over the contents of the
>>>>> whole
>>>>> directory if you want to it to be efficient.
>>>>> If you use DFS, then the contents of all parent directories up to the
>>>>> root
>>>>> will have to be stored in memory during the iteration, while if you use
>>>>> BFS,
>>>>> then only the sibling directories (without their contents) or the
>>>>> siblings's
>>>>> parents will be in memory.
>>>>>
>>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>>> optimizes memory usage for BFS:
>>>>>
>>>>> statsForDirectoryTree2: rootedPathName
>>>>>
>>>>> | dirs files bytes todo |
>>>>>
>>>>> dirs := files := bytes := 0.
>>>>> todo := OrderedCollection with: rootedPathName.
>>>>> [ todo isEmpty ] whileFalse: [
>>>>> | p |
>>>>> p := todo removeFirst.
>>>>> self directoryContentsFor: p do: [ :entry |
>>>>> entry isDirectory
>>>>> ifTrue: [
>>>>>
>>>>> todo addLast: p , self
>>>>> pathNameDelimiter asString , entry name.
>>>>> dirs := dirs + 1]
>>>>> ifFalse: [
>>>>>
>>>>> files := files + 1.
>>>>> bytes := bytes + entry fileSize
>>>>> ] ]
>>>>> ].
>>>>> ^{ dirs. files. bytes }
>>>>>
>>>>> This one uses #directoryTreeDo::
>>>>>
>>>>> statsForDirectoryTree3: rootedPathName
>>>>>
>>>>> | dirs files bytes |
>>>>>
>>>>> dirs := files := bytes := 0.
>>>>> (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>> | entry |
>>>>> entry := path last.
>>>>> entry isDirectory
>>>>> ifTrue: [ dirs := dirs + 1 ]
>>>>> ifFalse: [
>>>>>
>>>>> files := files + 1.
>>>>> bytes := bytes + entry fileSize ] ].
>>>>> ^{ dirs. files. bytes }
>>>>>
>>>>> Let's see the numbers:
>>>>>
>>>>> "Make sure the disk won't be accessed"
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> "Run the benchmark"
>>>>> #(statsForDirectoryTree: statsForDirectoryTree2:
>>>>> statsForDirectoryTree3:)
>>>>> collect: [ :each |
>>>>> each -> ((1 to: 5) collect: [ :run |
>>>>> Smalltalk garbageCollect.
>>>>> [ FileDirectory default perform: each with: '/usr/' ]
>>>>> timeToRun ]) ].
>>>>>
>>>>> {
>>>>> #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>> #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>> #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>>
>>>>> It's a bit noisy (i was too lazy to exit other programs), but the one
>>>>> using
>>>>> #directoryTreeDo: seems to be clearly slower than the existing
>>>>> algorithm
>>>>> using BFS.
>>>>>
>>>>>
>>>>> Levente
>>>>>
>>>>>
>>>>>>
>>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>>
>>>>>> Item was changed:
>>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'private') -----
>>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'enumeration') -----
>>>>>> directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>> "Value oneArgBlock with the path (an OrderedCollection of
>>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry
>>>>>> itself."
>>>>>> self entries do:
>>>>>> [ : each |
>>>>>> entriesCollection add: each.
>>>>>> oneArgBlock value: entriesCollection.
>>>>>> each isDirectory ifTrue:
>>>>>> [ | subdir |
>>>>>> subdir := each asFileDirectory.
>>>>>> subdir
>>>>>> directoryTreeDo: oneArgBlock
>>>>>> entries: entriesCollection ].
>>>>>> entriesCollection removeLast ]!
>>>>>>
>>>>>> Item was removed:
>>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>>> 'enumeration') -----
>>>>>> - statsForDirectoryTree: rootedPathName
>>>>>> - "Return the size statistics for the entire directory tree
>>>>>> starting
>>>>>> at the given root. The result is a three element array of the form:
>>>>>> (<number
>>>>>> of folders><number of files><total bytes in all files>). This method
>>>>>> also
>>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>>> - "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>>> -
>>>>>> - | dirs files bytes todo entries p |
>>>>>> - dirs := files := bytes := 0.
>>>>>> - todo := OrderedCollection with: rootedPathName.
>>>>>> - [todo isEmpty] whileFalse: [
>>>>>> - p := todo removeFirst.
>>>>>> - entries := self directoryContentsFor: p.
>>>>>> - entries do: [:entry |
>>>>>> - entry isDirectory
>>>>>> - ifTrue: [
>>>>>> - todo addLast: p , self
>>>>>> pathNameDelimiter asString , entry name.
>>>>>> - dirs := dirs + 1]
>>>>>> - ifFalse: [
>>>>>> - files := files + 1.
>>>>>> - bytes := bytes + entry
>>>>>> fileSize]]].
>>>>>> - ^ Array with: dirs with: files with: bytes
>>>>>> - !
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>
More information about the Squeak-dev
mailing list
|