|
Let's begin a project this month that will enable us to execute a search on one of the popular search engines and parse the results. We'll use the AltaVista search engine, but we'll code it in such a way that we can easily add others. This will probably take us several columns to accomplish, so this month we'll concentrate on creating the URL that executes the search and look at issues relating to that before we get into writing the networking code to talk to the search engine.
If you fire up your web browser, go to the AltaVista site, and then do a search on, say, "OS/2 Supersite", you will get a response page listing the hits for that string. Look in the active URL text area of your browser and you will see a long URL with a strange appearance:
http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&kl=XX&q=%22OS%2F2+Supersite%22&search.x=65&search.y=10Notice all the percent signs, plus signs and the ampersands. Those are special characters in a URL that is pointed at a CGI program (notice the "/cgi-bin/" in the URL). CGI, short for "Common Gateway Interface", is the most common way of running a program on a web server and then displaying the results in the user's web browser. (See the December, 1996 REXX Files column for a discussion about using REXX for CGI programs.)
A URL can not directly contain non-printable characters like tabs and carriage returns (and certain others), so the way to include them is to use hexadecimal notation. A tab is ASCII (decimal) code 9, and 9 in hexadecimal is 09. A carriage return is ASCII code 13, or 0D in hexadecimal. In a URL, the percent sign is used to signal that the next two digits are the hexadecimal representation of a character. Looking at the URL above, we see that the first percent sign is followed by "22" meaning hexadecimal 22, which is ASCII code 34, or the double quote. The next one is "%2F", ASCII code 47 or the forward slash, as in OS/2. So, the string that was entered in the search field on the AltaVista web page was
"OS/2 Supersite"and it gets turned into
%22OS%2F2+Supersite%22As you might suspect, REXX makes it very easy to go back and forth between characters and their hexadecimal codes. The c2x() function returns the hexadecimal code for a given character, and the x2c() function does the reverse. As you can see above, not all characters have to be converted into their hexadecimal codes. These characters are the digits and the letters of the (English) alphabet:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZSo all we have to do is check to see if a particular character is contained in the above set and convert it to a hexadecimal code if it is not. One way to do this is to loop over the characters of a string and use the Pos() function to see if they are contained in the list of valid characters. Here is some code to do this:
/* Convert a string to URL form */ AString='"OS/2 Supersite"' OkayChars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' NewString="" Do i = 1 to Length( AString) if Pos( SubStr( AString, i, 1), OkayChars) > 0 Then NewString = NewString || SubStr( AString, i, 1) Else NewString = NewString || '%' || c2x( SubStr( AString, i, 1)) End Say AString "converts to" NewStringBut there is still one thing we have to take care of: the space (ASCII 32) gets special treatment. Spaces in a string have to be converted to plus signs (+). To do this, we modify the above routine slightly:
/* Convert a string to URL form and handle spaces properly */ AString='"OS/2 Supersite"' OkayChars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' NewString="" Do i = 1 to Length( AString) if Pos( SubStr( AString, i, 1), OkayChars) > 0 Then NewString = NewString || SubStr( AString, i, 1) else Do If SubStr(AString,i,1)=" " Then NewString=NewString||"+" Else NewString = NewString || '%' || c2x( SubStr( AString, i, 1)) end end Say AString "converts to" NewStringNow, the more experienced readers will notice that the above routines are not as efficient as they could be. The primary inefficiency relates to the multiple calls to SubStr(). If you are converting small strings, then the multiple calls don't make much difference. But if you are converting large strings, you would be better off calling SubStr() once and storing the result in a variable like this:
AString='"OS/2 Supersite"' OkayChars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' NewString="" Do i = 1 to Length( AString) Test = SubStr( AString, i, 1) if Pos( Test, OkayChars) > 0 Then NewString = NewString || Test else Do If SubStr(AString,i,1)=" " Then NewString=NewString||"+" Else NewString = NewString || '%' || c2x( Test) end end Say AString "converts to" NewStringSo now we have a routine that will encode a string into the proper form for a URL. Next month we'll look at how the CGI program on the server end interprets the URL that we send to it and begin writing the communications code so that we can talk to it.
Dr. Dirk Terrell is an astronomer at the University of Florida specializing in interacting binary stars. His hobbies include cave diving, martial arts, painting and writing OS/2 software such as HTML Wizard.
Copyright © 1998 - Falcon Networking | ISSN 1203-5696 |