Ubuntu: Python script: How to chop the output to a limited line size?



Question:

I am using the python script for separating the domain from the respective emails and then grouping emails as per their respective domain. The following script work for me:

#!/usr/bin/env python3  from operator import itemgetter  from itertools import groupby  import os  import sys    dr = sys.argv[1]      for f in os.listdir(dr):      write = []      file = os.path.join(dr, f)      lines = [[l.strip(), l.split("@")[-1].strip()] for l in open(file).readlines()]      lines.sort(key=itemgetter(1))      for item, occurrence in groupby(lines, itemgetter(1)):          func = [s[0] for s in list(occurrence)]          write.append(item+","+",".join(func))      open(os.path.join(dr, "grouped_"+f), "wt").write("\n".join(write))  

I used : python3 script.py /path/to/input files
The input I gave was a list of emails and got the out as:

domain1.com,gemail1@domain1.com,email2@domain.com  domain2.com,email1@domain2.com,email2@domain2.com,email3@domain2.com  

But what the problem am facing is because of the MongoDB limit. As MongoDB has limit of 16 MB of document size and single line in my output file is considered as 1 document by MongoDB and the line size should not go beyond 16 MB.
So what I want to have is the result should get limited to 21 emails per domain and if the domain has more emails then it should be printed on a new line with the rest emails (again if emails are exceeding 21 then newline with same domain name). I cam store duplicate data in the mongoDB.

So the final output should be something like the following:

domain1.com,email1@domain1.com,email2@domain1.com,... email21@domain1.com  domain1.com,email22@domain1.com,.....  domain2.com,email1@domain2.com,....  

The dot (.) in the above example represents many text, which I chopped to make it simple to understand.
Hope this clarify my problem and hoping to get a solution for it.


Solution:1

New version

The script you posted indeed groups the emails by domain, with no limit in number. Below a version that will group emails by domain, but split the found list into arbitrary chunks. Each chunk will be printed into a line, starting with the corresponding domain.

The script

#!/usr/bin/env python3  from operator import itemgetter  from itertools import groupby, islice  import os  import sys    dr = sys.argv[1]  size = 3    def chunk(it, size):      it = iter(it); return iter(lambda: tuple(islice(it, size)), ())    for f in os.listdir(dr):      # list the files      with open(os.path.join(dr, "chunked_"+f), "wt") as report:           file = os.path.join(dr, f)          # create a list of email addresses and domains, sort by domain          lines = [[l.strip(), l.split("@")[-1].strip()] for l in open(file).readlines()]          lines.sort(key=itemgetter(1))          # group by domain, split into chunks          for domain, occurrence in groupby(lines, itemgetter(1)):              adr = list(chunk([s[0] for s in occurrence], size))              # write lines to output file              for a in adr:                  report.write(domain+","+",".join(a)+"\n")  

To use

  • Copy the script into an empty file, save it as chunked_list.py
  • In the head section, set the chunk size:

    size = 5  
  • Run the script with the directory as argument:

    python3 /path/to/chunked_list.py /path/to/files  

    It wil then create an edited file of each of the files, named chunked_filename, with the (chunked) grouped emails.

What it does

The script takes as input a directory with files like:

email1@domain1  email2@domain1  email3@domain2  email4@domain1  email5@domain1  email6@domain2  email7@domain1  email8@domain2  email9@domain1  email10@domain2  email11@domain1  

Of each file, it creates a copy, like:

domain1,email1@domain1,email2@domain1,email4@domain1  domain1,email5@domain1,email7@domain1,email9@domain1  domain1,email11@domain1  domain2,email3@domain2,email6@domain2,email8@domain2  domain2,email10@domain2  

(set cunksize = 3)


Solution:2

To support arbitrary large directories and files, you could use os.scandir() receiving files one by one and processing the files line by line:

#!/usr/bin/env python3  import os    def emails_with_domain(dirpath):      for entry in os.scandir(dirpath):          if not entry.is_file():              continue  # skip non-files          with open(entry.path) as file:              for line in file:                  email = line.strip()                  if email:  # skip blank lines                      yield email.rpartition('@')[-1], email  # domain, email  

To group email addresses by domain, no more than 21 emails per line, you could use collections.defaultdict():

import sys  from collections import defaultdict    dirpath = sys.argv[1]  with open('grouped_emails.txt', 'w') as output_file:      emails = defaultdict(list)  # domain -> emails      for domain, email in emails_with_domain(dirpath):          domain_emails = emails[domain]          domain_emails.append(email)          if len(domain_emails) == 21:              print(domain, *domain_emails, sep=',', file=output_file)              del domain_emails[:]  # clear        for domain, domain_emails in emails.items():          print(domain, *domain_emails, sep=',', file=output_file)  

Note:

  • all emails are saved to the same file
  • lines with the same domain are not necessarily adjacent

See What is the most "pythonic" way to iterate over a list in chunks?


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »