C Programming - Getting Data from an Input Stream

Richard Heathfield

Last updated: 31 October 2003


Jump Table
Introduction
scanfHere be dragons.
getsThe function that cannot be used safely.
fgetsA partial solution to the get-a-line problem.
fgetwordA word at a time, and no word too long!
fgetlineRead a line as long as your arm - or much, much longer.
related linksOther pages dealing with this subject.

Introduction

One of the first challenges facing the neophyte C programmer is that of obtaining data, either from the user, or from a file. It is a matter of some concern (to me, at least) that so many C teachers try to satisfy the student's need to get this data by introducing the extraordinarily complex and subtle scanf function and its close relative, fscanf.

The scanf function

Here is a typical "student" program that uses the scanf function to read from the standard input device. THIS CODE IS BUGGY! DO NOT USE IT!

    
#include <stdio.h>

int main(void)
{
  char *buf;
  scanf("%s", &buf);
  printf("Hello %s", buf);
  return 0;
}
    
    

This code has several problems. Firstly, it mistakenly passes the address of the pointer to scanf. When this program is run, the result is garbage. Hardly surprising, really. Let's fix that (BUT THE PROGRAM IS STILL BROKEN!)...

    
 #include <stdio.h>

int main(void)
{
  char *buf;
  scanf("%s", buf);
  printf("Hello %s", buf);
  return 0;
}
    
    

This code is still very poor. The programmer has made the rather common mistake of thinking that char * is the C way of spelling "string" - which is not true. Unfortunately, it is entirely possible that the program will "work" as the programmer expected it to.

When I compile this on my system and then run it, here is the output I get:


rjh@tux:~/dev/web/eton/c> ./scanf 
Richard
Hello Richardrjh@tux:~/dev/web/eton/c>

Alas, it works. And yet it is still broken. About the worst problem the newbie C programmer might spot with the result is that it fails to put the shell prompt on a new line. He can, of course, fix that by putting a '\n' character into the printf format string.

Having put that into the code, I compiled it using much stricter diagnostic checking, and here is the output my compiler provided:


rjh@tux:~/dev/web/eton/c< gcc -W -Wall -ansi -pedantic -O2 -o scanf scanf.c
scanf.c: In function `main':
scanf.c:5: warning: `buf' might be used uninitialized in this function

The real problem here is that we are telling scanf to store a string at the address provided, but we haven't actually allocated any storage in which to store that string. One easy fix for this is to use an array (BUT THE PROGRAM IS STILL BROKEN!):


#include <stdio.h>

int main(void)
{
  char buf[64];
  scanf("%s", buf);
  printf("Hello %s\n", buf);
  return 0;
}

This is getting a bit better, but there are still some problems. Firstly, we're not checking whether the scanf call succeeded. That's no big deal if it did succeed, but can be a very big deal indeed if it didn't.

Secondly, we can't store any word with a length greater than or equal to the size of the array.

Thirdly, we can't guarantee that the user won't try to exceed that limit - and scanf will do nothing to stop the user from running straight over the end of the array, either accidentally or maliciously. (In case you were wondering, this is one of the ways you can make your code vulnerable to a buffer overflow attack.)

This next version of the program fixes two of these problems - the first and the last:


#include <stdio.h>

int main(void)
{
  char buf[64];
  if(1 == scanf("%63s", buf))
  {
    printf("Hello %s\n", buf);
  }
  else
  {
    fprintf(stderr, "Input error.\n");
  }
  return 0;
}


This code is much better. Put in an explanatory comment or two, and I'd award you nine out of ten (if you were in your first week as a C programmer!).

But it still suffers from one problem - what if the input stream contains a word that is longer than the size of the array? We can stop the outsized word from corrupting memory easily enough; that's what the 63 is doing in "%63s". But we're still losing data. What we'd really like is to get the whole word in one fell swoop. Later on in this article, we'll design a solution to this problem.

If anyone has written an article on the robust use of scanf, in all its hideous complexity, and would like me to link to that article here, please get in touch.

Actually, there's another problem that I haven't mentioned yet - what if we want to get a whole line, rather than a single word? Well, the C library provides a function to get an entire line of input from the standard input device. Unfortunately, this function, gets, is deeply flawed.


The gets function


The gets function belongs to a bygone age, when users behaved themselves, buffer overrun attacks were unheard of, and programmers were less aware of the importance of robust code. (I'm not sure whether such an age ever existed, but that's another discussion for another time, if ever.)

gets takes as much data as it can find in the standard input stream, up to either the end of the stream or a newline character if that is encountered first. It reads and discards the newline character, and stores all the characters before that in consecutive memory locations, beginning at the address you supply as an argument. If this turns out to be more characters than you had memory for, well, that's your tough luck.

It is possible for an unscrupulous user of your code to exploit the gets function for initiating a buffer overrun attack. I'm not going to go into the details. You can find them easily enough on the Web. But I will just mention that this is not a theoretical problem. Ever since the infamous Internet Worm of 1988, malicious programmers have been exploiting programs that use gets. The lessons are clear: (1) protect your buffer! (2) never use gets because it makes (1) impossible.

I will not demonstrate the use of the gets function here. Why tempt fate?


The fgets function


What we could really do with is a function like gets but which accepts a parameter that specifies the size of the buffer, and which promises not to write more than that many characters into your buffer. There is such a function, of course - a standard C library function named fgets. This function not only accepts a buffer size parameter, but also a stream parameter - so you can use it for fetching data from any text stream open for input! This is very useful indeed. (If you just want data from the standard input device, use the standard input stream pointer stdin as the third argument to fgets.)

What happens if fgets encounters a line that is longer than the buffer we provide? Well, it's very simple - the function stops reading bytes from the input stream as soon as the buffer is full. At that point, it writes a '\0' character as the last character in the buffer (no, it's all right, it knows to read only n - 1 bytes of data, to leave room for the '\0').

Can we detect whether a complete line was read? Yes. If the character just before the null terminating character of the populated buffer is not a '\n' character, then we know that the line is incomplete. To get the rest of the line, we can call fgets again when we're ready for the rest of the data. (Yes, I agree that this isn't exactly satisfactory. Have patience, and we'll get there in the end.)

If we know in advance the longest line we expect to encounter, we can use fgets for very easy line-at-a-time processing. But there's a hitch. Typically, we don't want the '\n' character on the end of the line. Having said that, we do want to know that it's there (to assure ourselves that a complete line has been read). This leads to some fairly boiler-plate C code - some "spot the newline" detection code, and a little helper function to replace a '\n' character with a '\0' character.


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLINE 128

int chomp(char *s)
{
  int chomped = 0;

  char *p = strchr(s, '\n');
  if(p != NULL)
  {
    *p = '\0';
    chomped = 1;
  }
  return chomped;
}

int main(void)
{
  char buf[MAXLINE] = {0};
  int rc = 0;
  while(0 == rc && fgets(buf, sizeof buf, stdin) != NULL)
  {
    if(chomp(buf))
    {
      printf("Got the line [%s]\n", buf);
    }
    else
    {
      printf("Line too long! Aborting.\n");
      rc = EXIT_FAILURE;
    }
  }
  return rc;
}

As you can see, this example program quits early if it encounters data it can't handle. That's not really good enough for me, and I don't suppose it's good enough for you, either. So - what are we going to do about it? More to the point, why isn't there already a function to get a complete line from an input stream, irrespective of its length, in a memory-safe way?

Let's deal with the second point first (I only do this to annoy!). To get a complete line without knowing its length in advance, we have to find a way to get a buffer large enough, without going to the extent of specifying a fixed-size buffer of ludicrous proportions just on the off-chance that we might meet a very long line.

To achieve this, we can make use of dynamic memory allocation. It is certainly possible for a data acquisition function to resize a dynamic buffer as it goes along, always making sure that the buffer is large enough to accommodate the data it reads from the input stream.

Once we make up our minds to do this, however, we have to consider the function's interface with the caller. Do we simply return a pointer to a freshly-allocated buffer? This is certainly very tempting, but what if we call the function in a loop (which, typically, we will want to do)? To prevent memory leakage, the user would have to either copy the pointer safely away or release the memory before calling the function again.

Another possibility is for the function to maintain a buffer internally; this way, the calling code wouldn't have to worry about memory management - but on the other hand, how would we free the buffer when the program is done with data acquisition? A special parameter? Maybe.

A third possibility is to pass the address of a pointer to a reallocable buffer into the function. This is quite a nice idea, because it means the function can re-use the buffer on consecutive calls, but it does mean that we need to keep track of the buffer size.

All these design decisions have a certain amount of merit, and there's no single, obvious, right answer. And that is why (or at least, I hope it's why) the standard C library doesn't include a function of this kind; whichever interface they chose, there would definitely be some people who thought it was the wrong choice! Also, of course, it's perfectly possible to implement a function of this kind using existing ISO C functions, so it's not unreasonable to leave such design choices to the individual programmer.

I have written two functions of this kind - one for reading a complete word (however large) from an input stream, and another for reading a complete line, again of arbitrary size, from the stream.

Let's look, first of all, at the function for getting a complete word at a time:

The fgetword function


Here's the prototype for fgetword:


int fgetword(char **word,
             size_t *size,
             const char *delimiters,
             size_t maxrecsize,
             FILE *fp,
             unsigned int flags);

This hyperlink leads to fgetword.c - you will also need fgetdata.h which you should place in your include path.

The fgetword function reads a word at a time. A word is defined as a sequence of characters that does not include any of the characters in the delimiters argument you supply to the function.

The function uses a reallocable buffer. You can get yourself a buffer and pass it in if you wish, but there is no need, since fgetword is perfectly capable of providing one for you. To take advantage of this, you need only pass in the address of a char * that points to NULL. If you do decide to use a buffer you allocated yourself, you must know how big that buffer is, and you must tell fgetword. You do this by populating a size_t object with the exact capacity of the buffer and passing its address in as the second argument.

Here is a fairly typical way to use fgetword:

    
#include <stdio.h>
#include <stdlib.h>

#include "fgetdata.h"

int main(void)
{
  char *delimiters =" \t\r\n\f\v\a\b\\?\\\'\"!%^&*()=+/<>,.|[]{}#~";
  char *line = NULL;
  size_t size = 0;
  while(0 == fgetword(&line,
                      &size,
                      delimiters,
                      (size_t)-1,
                      stdin,
                      0))
  {
    printf("Word found: [%s]\n", line);
  }
  free(line);
  return 0;
}

As you can see, it is necessary to pass the address of the buffer pointer, because the fgetword function can (and sometimes does) need to change the location of the buffer. This is why it is essential that you don't use an auto or static buffer.

Note that the size information you give to fgetword is updated within the routine, so you can find out how much memory is tied up in the buffer. If you think it's too much, by all means reduce it yourself using realloc or, alternatively, pass the FGDATA_REDUCE flag as the last argument to the function. This will cause fgetword to reduce the buffer size to the minimum necessary to handle the current word. Note that you have absolute control over the buffer size, via the fourth parameter. If you don't want to limit the buffer size, set this to (size_t)-1. If you want the buffer size to be limited, this parameter is your chance to be strict. :-)

The fgetline function


Here's the prototype for fgetline:


int fgetline(char **line,
             size_t *size,
             size_t maxrecsize,
             FILE *fp,
             unsigned int flags);

This hyperlink leads to fgetline.c - you will also need fgetdata.h which you should place in your include path.

The fgetline function reads a line at a time. It is effectively equivalent in most respects to fgetword(line, size, "\n", maxrecsize, fp, flags); - the only difference being that whereas fgetword treats an empty string as equivalent to end-of-file (think about it!), the fgetline function will (correctly) retain blank lines. This means it may be necessary to test line[0] against '\0' before processing a line, depending on the needs of your application.

Other pages and routines dealing with this subject.


Chuck Falconer's ggets function.

Morris Dovey's getsm function.

Strmsrfr's fgetanyline function.




Valid HTML 4.01!



That's it for now. Suggestions for improvements to this page are most welcome. I would like to offer my thanks to the authors of the bluefish HTML editor, which I used for composing this article.


You are visitor number ONE!!! WOW!!! (No, not really) - call again soon!