By Pablo Varando
Expert Author
Article Date: 2003-12-31
This tutorial will show you how to create a local file crawler that will enable you to find a specified document type (i.e. PDF files) within a directory (and it's children directories).
I want to begin by explaining a little bit about what a crawler is, some of you might be like... a what? :)
A crawler is a script that will basically return matching items you specify for it to find... I think the best example you can see is the actual code itself, so lets get started:
The first example will be a local file crawler, now what this does is this; say you have a directory structure that looks like this:
D:websitesinformation.pdf
D:websitesaccount_info.pdf
D:websitesmysite.cominfo.pdf
D:websiteshello kittyfree_stuff.pdf
Now, notice that the PDF files are on all different types of folder under the D:websites folder, so that will become the ROOT FOLDER.
<!--- define an empty variable that will become a list of directories
to search later in the application --->
<cfset current_directory_to_crawl = "">
<!--- now by default define the root folder to search, in this example D:websites --->
<cfset next_directory_to_crawl = "D:websites">
<!--- Now define a variable that will tell the application later on if it should continue
At default set the value to 'one' --->
<cfset crawl_again = 1>
<!--- now define a variable that will count the number of files found and set it to 'zero' by default --->
<cfset file_counter = 0>
<!--- do >>ONLY<< one extension per run --->
<cfset extension_to_crawl = "pdf">
<!--- define a variable to hold the file names of the files found --->
<cfset file_container = "">
<!--- create a container to hold all files processed (If you are wanting to move them elsewhere) --->
<cfset file_completed = "">
<!--- ok, here begin the processing because the variable
crawl_again is set to 1 (stop when set to 0) --->
<cfloop condition="crawl_again neq 0">
<!--- first switch the directory values --->
<cfset current_directory_to_crawl = next_directory_to_crawl>
<!--- now clear the next --->
<cfset next_directory_to_crawl = "">
<!--- Clear the file container --->
<cfset file_container = "">
<!--- Now loop through the list of directories to crawl and look for the extensions --->
<cfloop list="#current_directory_to_crawl#" index="dir" delimiters="|">
<!---- now list the directory contents --->
<cfdirectory action="LIST"
directory="#dir#"
name="CurrentPull">
<!--- first get all the files --->
<cfloop query="CurrentPull">
<!---- process everything returned in the CFDIRECTORY with the exception of the first to
records which are "." and "..". Those can be skipped for this example --->
<cfif name neq "." OR name neq "..">
<!--- display the current file/directory to the screen --->
<cfoutput>#name#<BR></cfoutput>
<!--- lets see if the current item is a file or directory --->
<cfif type eq "dir">
<!--- Found a directory, set this folder as crawlable
so on the next loop we can search it for PDF files --->
<cfset next_directory_to_crawl =
ListAppend(next_directory_to_crawl, dir & name & "", "|")>
<cfelseif type eq "file">
<!--- this is a file, see if the extension of the file is the one defined above --->
<cfif ListLast(name, ".") eq extension_to_crawl>
<!--- here is checks to make sure that this file and it's path is UNIQUE --->
<cfif NOT ListFind(file_completed, dir & name, "|")>
<!--- define this file are completed --->
<cfset file_completed = ListAppend(file_completed, dir & name, "|")>
<!--- add the file to the container --->
<cfset file_container = ListAppend(file_container, dir & name, "|")>
<!--- add one to the file counter --->
<cfset file_counter = file_counter + 1>
</cfif>
</cfif>
</cfif>
</cfif>
</cfloop>
</cfloop>
<!--- now output the final values to the screen so we can see them --->
<cfoutput>
<hr><ol>
<cfloop list="#next_directory_to_crawl#" index="folder" delimiters="|">
<li>#folder#</li>
</cfloop>
</ol>
<hr><ol>
<cfloop list="#file_container#" index="files" delimiters="|">
<li>#files#</li>
</cfloop>
</ol>
<HR>Files Found: #file_counter#<hr>
</cfoutput>
<cfif next_directory_to_crawl eq "">
<!--- There are no more folders to crawl, stop the main loop --->
<cfset crawl_again = 0>
</cfif>
</cfloop>
That's pretty much it, that will make a local crawler to find files and much more!
Click here to sign up for FREE Tech newsletters from iEntry!