scanf | Here be dragons. |
gets | The function that cannot be used safely. |
fgets | A partial solution to the get-a-line problem. |
fgetword | A word at a time, and no word too long! |
fgetline | Read a line as long as your arm - or much, much longer. |
related links | Other pages dealing with this subject. |
One of the first challenges facing the neophyte C programmer is that of obtaining data, either from
the user, or from a file. It is a matter of some concern (to me, at least) that so many C teachers try
to satisfy the student's need to get this data by introducing the extraordinarily complex and subtle
scanf function and its close relative, fscanf.
scanf function
Here is a typical "student" program that uses the scanf function to read
from the standard input device. THIS CODE IS BUGGY! DO NOT USE IT!
#include <stdio.h>
int main(void)
{
char *buf;
scanf("%s", &buf);
printf("Hello %s", buf);
return 0;
}
This code has several problems. Firstly, it mistakenly passes the address of the pointer to scanf. When this program is run, the result is garbage. Hardly surprising, really. Let's fix that (BUT THE PROGRAM IS STILL BROKEN!)...
#include <stdio.h>
int main(void)
{
char *buf;
scanf("%s", buf);
printf("Hello %s", buf);
return 0;
}
This code is still very poor. The programmer has made the rather common mistake of thinking that
char * is the C way of spelling "string" - which is not true. Unfortunately,
it is entirely possible that the program will "work" as the programmer expected it to.
When I compile this on my system and then run it, here is the output I get:
rjh@tux:~/dev/web/eton/c> ./scanf
Richard
Hello Richardrjh@tux:~/dev/web/eton/c>
Alas, it works. And yet it is still broken. About the worst problem the newbie C programmer
might spot with the result is that it fails to put the shell prompt on a new line. He can, of course,
fix that by putting a '\n' character into the printf format string.
Having put that into the code, I compiled it using much stricter diagnostic checking, and here is the output my compiler provided:
rjh@tux:~/dev/web/eton/c< gcc -W -Wall -ansi -pedantic -O2 -o scanf scanf.c
scanf.c: In function `main':
scanf.c:5: warning: `buf' might be used uninitialized in this function
The real problem here is that we are telling scanf to store a string at the
address provided, but we haven't actually allocated any storage in which to store that string.
One easy fix for this is to use an array (BUT THE PROGRAM IS STILL BROKEN!):
#include <stdio.h>
int main(void)
{
char buf[64];
scanf("%s", buf);
printf("Hello %s\n", buf);
return 0;
}
This is getting a bit better, but there are still some problems. Firstly, we're not checking whether
the scanf call succeeded. That's no big deal if it did succeed, but can be a
very big deal indeed if it didn't.
Secondly, we can't store any word with a length greater than or equal to the size of the array.
Thirdly, we can't guarantee that the user won't try to exceed that limit - and scanf
will do nothing to stop the user from running straight over the end of the array, either accidentally
or maliciously. (In case you were wondering, this is one of the ways you can make your code
vulnerable to a buffer overflow attack.)
This next version of the program fixes two of these problems - the first and the last:
#include <stdio.h>
int main(void)
{
char buf[64];
if(1 == scanf("%63s", buf))
{
printf("Hello %s\n", buf);
}
else
{
fprintf(stderr, "Input error.\n");
}
return 0;
}
This code is much better. Put in an explanatory comment or two, and I'd award you nine out of ten (if you were in your first week as a C programmer!).
But it still suffers from one problem - what if the input stream contains a word that is longer than the size of the array? We can stop the outsized word from corrupting memory easily enough; that's what the 63 is doing in "%63s". But we're still losing data. What we'd really like is to get the whole word in one fell swoop. Later on in this article, we'll design a solution to this problem.
If anyone has written an article on the robust use of scanf, in all
its hideous complexity, and would like me to link to that article here, please get in touch.
Actually, there's another problem that I haven't mentioned yet - what if we want to get a whole line,
rather than a single word? Well, the C library provides a function to get an entire line of input from
the standard input device. Unfortunately, this function, gets, is deeply flawed.
gets functionThe gets function belongs to a bygone age, when users behaved themselves,
buffer overrun attacks were unheard of, and programmers were less aware of the importance of
robust code. (I'm not sure whether such an age ever existed, but that's another discussion for
another time, if ever.)
gets takes as much data as it can find in the standard input stream, up to either
the end of the stream or a newline character if that is encountered first. It reads and discards the
newline character, and stores all the characters before that in consecutive memory locations,
beginning at the address you supply as an argument. If this turns out to be more characters than
you had memory for, well, that's your tough luck.
It is possible for an unscrupulous user of your code to exploit the gets function
for initiating a buffer overrun attack. I'm not going to go into the details. You can find them
easily enough on the Web. But I will just mention that this is not a theoretical problem. Ever since
the infamous Internet Worm of 1988, malicious programmers have been exploiting programs that
use gets. The lessons are clear: (1) protect your buffer! (2) never use gets
because it makes (1) impossible.
I will not demonstrate the use of the gets function here. Why tempt fate?
fgets functionWhat we could really do with is a function like gets but which accepts
a parameter that specifies the size of the buffer, and which promises not to write more than that
many characters into your buffer. There is such a function, of course - a standard C library
function named fgets. This function not only accepts a buffer size parameter, but
also a stream parameter - so you can use it for fetching data from any text stream open for input!
This is very useful indeed. (If you just want data from the standard input device, use the standard
input stream pointer stdin as the third argument to fgets.)
What happens if fgets encounters a line that is longer than the buffer
we provide? Well, it's very simple - the function stops reading bytes from the input stream as soon
as the buffer is full. At that point, it writes a '\0' character as the last character in
the buffer (no, it's all right, it knows to read only n - 1 bytes of data, to leave room
for the '\0').
Can we detect whether a complete line was read? Yes. If the character just before the null
terminating character of the populated buffer is not a '\n' character, then
we know that the line is incomplete. To get the rest of the line, we can call fgets
again when we're ready for the rest of the data. (Yes, I agree that this isn't exactly satisfactory.
Have patience, and we'll get there in the end.)
If we know in advance the longest line we expect to encounter, we can use fgets
for very easy line-at-a-time processing. But there's a hitch. Typically, we don't want the
'\n' character on the end of the line. Having said that, we do want to know that it's
there (to assure ourselves that a complete line has been read). This leads to some fairly boiler-plate
C code - some "spot the newline" detection code, and a little helper function to replace
a '\n' character with a '\0' character.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 128
int chomp(char *s)
{
int chomped = 0;
char *p = strchr(s, '\n');
if(p != NULL)
{
*p = '\0';
chomped = 1;
}
return chomped;
}
int main(void)
{
char buf[MAXLINE] = {0};
int rc = 0;
while(0 == rc && fgets(buf, sizeof buf, stdin) != NULL)
{
if(chomp(buf))
{
printf("Got the line [%s]\n", buf);
}
else
{
printf("Line too long! Aborting.\n");
rc = EXIT_FAILURE;
}
}
return rc;
}
As you can see, this example program quits early if it encounters data it can't handle. That's not really good enough for me, and I don't suppose it's good enough for you, either. So - what are we going to do about it? More to the point, why isn't there already a function to get a complete line from an input stream, irrespective of its length, in a memory-safe way?
Let's deal with the second point first (I only do this to annoy!). To get a complete line without knowing its length in advance, we have to find a way to get a buffer large enough, without going to the extent of specifying a fixed-size buffer of ludicrous proportions just on the off-chance that we might meet a very long line.
To achieve this, we can make use of dynamic memory allocation. It is certainly possible for a data acquisition function to resize a dynamic buffer as it goes along, always making sure that the buffer is large enough to accommodate the data it reads from the input stream.
Once we make up our minds to do this, however, we have to consider the function's interface with the caller. Do we simply return a pointer to a freshly-allocated buffer? This is certainly very tempting, but what if we call the function in a loop (which, typically, we will want to do)? To prevent memory leakage, the user would have to either copy the pointer safely away or release the memory before calling the function again.
Another possibility is for the function to maintain a buffer internally; this way, the calling code wouldn't have to worry about memory management - but on the other hand, how would we free the buffer when the program is done with data acquisition? A special parameter? Maybe.
A third possibility is to pass the address of a pointer to a reallocable buffer into the function. This is quite a nice idea, because it means the function can re-use the buffer on consecutive calls, but it does mean that we need to keep track of the buffer size.
All these design decisions have a certain amount of merit, and there's no single, obvious, right answer. And that is why (or at least, I hope it's why) the standard C library doesn't include a function of this kind; whichever interface they chose, there would definitely be some people who thought it was the wrong choice! Also, of course, it's perfectly possible to implement a function of this kind using existing ISO C functions, so it's not unreasonable to leave such design choices to the individual programmer.
I have written two functions of this kind - one for reading a complete word (however large) from an input stream, and another for reading a complete line, again of arbitrary size, from the stream.
Let's look, first of all, at the function for getting a complete word at a time:
fgetword functionHere's the prototype for fgetword:
int fgetword(char **word,
size_t *size,
const char *delimiters,
size_t maxrecsize,
FILE *fp,
unsigned int flags);
This hyperlink leads to fgetword.c - you will also need
fgetdata.h which you should place in your include path.
The fgetword function reads a word at a time. A word is defined as a sequence
of characters that does not include any of the characters in the delimiters argument you
supply to the function.
The function uses a reallocable buffer. You can get yourself a buffer and pass it in if you wish, but
there is no need, since fgetword is perfectly capable of providing one for you. To take
advantage of this, you need only pass in the address of a char * that points to
NULL. If you do decide to use a buffer you allocated yourself, you must know
how big that buffer is, and you must tell fgetword. You do this by populating a
size_t object with the exact capacity of the buffer and passing its address in as the
second argument.
Here is a fairly typical way to use fgetword:
#include <stdio.h>
#include <stdlib.h>
#include "fgetdata.h"
int main(void)
{
char *delimiters =" \t\r\n\f\v\a\b\\?\\\'\"!%^&*()=+/<>,.|[]{}#~";
char *line = NULL;
size_t size = 0;
while(0 == fgetword(&line,
&size,
delimiters,
(size_t)-1,
stdin,
0))
{
printf("Word found: [%s]\n", line);
}
free(line);
return 0;
}
As you can see, it is necessary to pass the address of the buffer pointer, because the
fgetword function can (and sometimes does) need to change the location of the buffer.
This is why it is essential that you don't use an auto or static
buffer.
Note that the size information you give to fgetword is updated within the routine, so
you can find out how much memory is tied up in the buffer. If you think it's too much, by all means
reduce it yourself using realloc or, alternatively, pass the FGDATA_REDUCE
flag as the last argument to the function. This will cause fgetword to reduce the buffer size
to the minimum necessary to handle the current word. Note that you have absolute control over the buffer
size, via the fourth parameter. If you don't want to limit the buffer size, set this to (size_t)-1.
If you want the buffer size to be limited, this parameter is your chance to be strict. :-)
fgetline functionHere's the prototype for fgetline:
int fgetline(char **line,
size_t *size,
size_t maxrecsize,
FILE *fp,
unsigned int flags);
This hyperlink leads to fgetline.c - you will also need
fgetdata.h which you should place in your include path.
The fgetline function reads a line at a time. It is effectively equivalent in most respects
to fgetword(line, size, "\n", maxrecsize, fp, flags); - the only difference being that
whereas fgetword treats an empty string as equivalent to end-of-file (think about it!), the
fgetline function will (correctly) retain blank lines. This means it may be necessary to
test line[0] against '\0' before processing a line, depending on the needs
of your application.
Chuck Falconer's ggets function.
Morris Dovey's getsm function.
Strmsrfr's fgetanyline function.
That's it for now. Suggestions for improvements to this page are most welcome. I would like to offer my thanks to the authors of the bluefish HTML editor, which I used for composing this article.
You are visitor number - call again soon!