Ubuntu: How do I generate a running cumulative total of the numbers in a text file?


I have a text file with 2 million lines. Each line has a positive integer. I am trying to form a frequency table kind of thing.

Input file:

3  4  5  8  

Output should be:

3  7  12  20  

How do I go about doing this?


With awk:

awk '{total += $0; $0 = total}1'  

$0 is the current line. So, for each line, I add it to the total, set the line to the new total, and then the trailing 1 is an awk shortcut - it prints the current line for every true condition, and 1 as a condition evaluates to true.


In a python script:

#!/usr/bin/env python3  import sys    f = sys.argv[1]; out = sys.argv[2]    n = 0    with open(out, "wt") as wr:      with open(f) as read:          for l in read:              n = n + int(l); wr.write(str(n)+"\n")  

To use

  • Copy the script into an empty file, save it as add_last.py
  • Run it with the source file and targeted output file as arguments:

    python3 /path/to/add_last.py <input_file> <output_file>  


The code is rather readable, but in detail:

  • Open output file for writing results

    with open(out, "wt") as wr:  
  • Open input file for reading per line

    with open(f) as read:      for l in read:  
  • Read the lines, adding the value of the new line to the total:

    n = n + int(l)  
  • Write the result to the output file:



Just for fun

$ sed 'a+p' file | dc -e0 -  3  7  12  20  

This works by appending +p to each line of the input, and then passing the result to the dc calculator where

   +      Pops two values off the stack, adds them, and pushes the result.            The precision of the result is determined only by the values  of            the arguments, and is enough to be exact.  


   p      Prints  the  value on the top of the stack, without altering the            stack.  A newline is printed after the value.  

The -e0 argument pushes 0 onto the dc stack to initialize the sum.


In Bash:

#! /bin/bash    file="YOUR_FILE.txt"    TOTAL=0  while IFS= read -r line  do      TOTAL=$(( TOTAL + line ))      echo $TOTAL  done <"$file"  


To print partial sums of integers given on the standard input one per line:

#!/usr/bin/env python3  import sys    partial_sum = 0  for n in map(int, sys.stdin):      partial_sum += n      print(partial_sum)  

Runnable example.

If for some reason the command is too slow; you could use the C program:

#include <inttypes.h>  #include <ctype.h>  #include <stdio.h>    int main(void)  {    uintmax_t cumsum = 0, n = 0;    for (int c = EOF; (c = getchar()) != EOF; ) {      if (isdigit(c))        n = n * 10 + (c - '0');      else if (n) { // complete number        cumsum += n;        printf("%" PRIuMAX "\n", cumsum);        n = 0;      }    }    if (n)      printf("%" PRIuMAX "\n", cumsum + n);    return feof(stdin) ? 0 : 1;  }  

To build it and run, type:

$ cc cumsum.c -o cumsum  $ ./cumsum < input > output  

Runnable example.

UINTMAX_MAX is 18446744073709551615.

The C code is several times faster than the awk command on my machine for the input file generated by:

#!/usr/bin/env python3  import numpy.random  print(*numpy.random.random_integers(100, size=2000000), sep='\n')  


You probably want something like this:

sort -n <filename> | uniq -c | awk 'BEGIN{print "Number\tFrequency"}{print $2"\t"$1}'  

Explanation of the command:

  • sort -n <filename> | uniq -c sorts the input and returns a frequency table
  • | awk 'BEGIN{print "Number\tFrequency"}{print $2"\t"$1}' turns the ooutput into a nicer Format

Input File list.txt:

4  5  3  4  4  2  3  4  5  

The command:

$ sort -n list.txt | uniq -c | awk 'BEGIN{print "Number\tFrequency"}{print $2"\t"$1}'  Number  Frequency  2   1  3   2  4   4  5   2  


You can do this in vim. Open the file and type the following keystrokes:


Note that <C-a> is actually ctrl-a, and <cr> is carriage return, i.e. the enter button.

Here's how this works. First off, we want to clear out register 'a' so that it has no side-effects on the first time through. This is simply qaq. Then we do the following:

qa                  " Start recording keystrokes into register 'a'    yiw               " Yank this current number       j              " Move down one line. This will break the loop on the last line        @"            " Run the number we yanked as if it was typed, and then          <C-a>       " increment the number under the cursor *n* times               @a     " Call macro 'a'. While recording this will do nothing                 q    " Stop recording                  @a  " Call macro 'a', which will call itself creating a loop  

After this recursive macro is done running, we simply call :wq<cr> to save and quit.


Perl one-liner:

$ perl -lne 'print $sum+=$_' input.txt                                                                  3  7  12  20  

With 2.5 million lines of numbers, it takes about 6.6 seconds to process:

$ time perl -lne 'print $sum+=$_' large_input.txt > output.txt                                              0m06.64s real     0m05.42s user     0m00.09s system    $ wc -l large_input.txt  2500000 large_input.txt  


A simple Bash one-liner:

x=0 ; while read n ; do x=$((x+n)) ; echo $x ; done < INPUT_FILE  

x is the cumulated sum of all numbers from the current line and above.
n is the number in the current line.

We loop over all the lines n of INPUT_FILE and add their numeric value to our variable x and print that sum during each iteration.

Bash is a bit slow here though, you can expect this to run around 20-30 seconds for a file with 2 million entries, without printing the output to the console (which is even slower, independend of the method you use).


Similar to @steeldriver's answer, but with the slightly less arcane bc instead:

sed 's/.*/a+=&;a/' input | bc  

The nice thing about bc (and dc) is that they are arbitrary precision calculators, so will never overflow or suffer lack of precision over integers.

The sed expression transforms the input to:

a+=3;a  a+=4;a  a+=5;a  a+=8;a  

This is then evaluated by bc. The a bc variable is auto-initialised to 0. Each line increments a, then explicitly prints it.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »