Score:1

multiple portion selection of a string in python

us flag

I have a log file as below:

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:27 +0330 SOCK5.6699 00094 user156 32.99.193.2:51242 1.1.1.1:443 715 388 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40048 1.1.1.1:443 18105 29029 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40070 1.1.1.1:443 674 26805 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:24 +0330 SOCK5.6699 00000 user143 112.199.63.119:60682 1.1.1.1:443 475 445 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:37 +0330 SOCK5.6699 00000 user105 191.184.66.98:40102 1.1.1.1:443 12913 18780 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:42 +0330 SOCK5.6699 00000 user143 112.199.63.119:60688 1.1.1.1:443 4530 34717 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:44 +0330 SOCK5.6699 00000 user127 212.167.145.49:2972 1.1.1.1:443 827 267 0 CONNECT 1.1.1.1:443

my goal is to extract two portions of this log file:

  1. Username
  2. IP address of the user source

below is a sample of the portions of data needed.

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443

So I wrote a Python script to extract both items and store them in separate lists and then joined them with zip function.

import pprint
import collections

iplist=[]
for l in data:
    ip_port=l[53:71]
    iplist.append(ip_port.split(':')[0])


userlist=[]
for u in data:
    user=u[42:52]
    userlist.append(user.replace(" ", ""))

a=list(zip(iplist,userlist))
most_ip=collections.Counter(a).most_common(5)
pprint.pprint(most_ip)

This code works fine, and I'm able to get the top used ip with its corresponding username. Also need to mention that I didn't use re module, since it was listing the second IP (destination IP which is 1.1.1.1- which I don't care about it)

Question: Is there any other way(more neat wey) than the way I've written the code?

dirkt avatar
in flag
You could have used `cut` (commandline tool).
Zareh Kasparian avatar
us flag
@dirkt this is a Linux/unix based command, I'm trying to use Python. since I want to use the script to some none-Unix systems as well.
cn flag
This is probably a better fit for StackOverflow since it's about programming. Not sure if it's an answer to your actual problem but there are lots of tools to parse logs out there, such as the Elastic FileBeats utility, among many others. You could also look at PyGrok.
cn flag
Also, you're doing 2 iterations through the data which is slow. Do one, split each line on spaces, pull out the fields you need by index and add them to the dictionary. You'll do it in half the time.
Zareh Kasparian avatar
us flag
@shearn89 Thanks shearn89, you mentioned a good point. I have edited my code, it looks simpler and much clear now.
Score:1
pm flag

There are many capabilities to optimize also your new code. The two things catching me most:

Do not execute split() more than once for each line of the log, just execute split() once and store the result in a variable, because each execution of this functions needs some time (even its not much, but will add up the more data you process).

s = i.split(' ')
ip=s[6].split(':')[0]
user=s[5]

Why creating two list and then zipping them together afterwards? Just store the tuples directly in a list:

l = []
for i in data:
   s = i.split(' ')
   ip=s[6].split(':')[0]
   user=s[5]
   l.append(tuple((ip, user)))
top_used=collections.Counter(l).most_common(5)
Zareh Kasparian avatar
us flag
Thanks for your code. having tuple in this case is just for speeding up the code?
Misc08 avatar
pm flag
@ZarehKasparian Indeed creating the tuples directly is speeding up the code, since you don't need the zip-function anymore, which is basically creating tuples from those two lists, see https://docs.python.org/3/library/functions.html#zip
Score:1
us flag

With the suggestion of "shearn89" I have edited my code as below:

much simpler with a single iteration.

userlist=[]
iplist=[]
for i in data:
    ip=i.split(' ')[6].split(':')[0]
    user=i.split(' ')[5]
    iplist.append(ip)
    userlist.append(user)

top_used=collections.Counter(zip(iplist,userlist)).most_common(5)
pprint.pprint(top_used)
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.