Score:0

How to replace URLs from html to a file using grep, sed or any common tools?

tv flag

I'm trying to replace URLs of my conf file from a HTML file because sometimes the URLs get updated/changed. I need a simple script that can fetch the HTML & update/replace the URLs in my conf file. I'm new in Ubuntu/Linux.

Edit: The HTML file gets changed by the server admin where I don't have any access. I can only visit the up-to-date site by using below Html Location & update my conf file manually.

Html Location: https://10.10.10.1

Part of HTML file:

<li><a href="#" class="dropdown-toggle hvr-bounce-to-bottom" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Movies<span class="caret"></span></a>
                                    <ul class="dropdown-menu">
                                        
<li><a class="hvr-bounce-to-bottom" href="http://10.10.10.7/MY-FTP-2/English%20Movies/">English Movies</a></li>
<li><a class="hvr-bounce-to-bottom" href="http://10.10.10.8/MY-FTP-1/English%20Movies%20%281080p%29/">English Movies -1080p </a></li>
                                        
                                        <li><a class="hvr-bounce-to-bottom" href="http://10.10.10.9/MY-FTP-1/Hindi%20Movies/">Hindi Movies</a></li>
                                        <li><a class="hvr-bounce-to-bottom" href="http://10.10.10.8/MY-FTP-1/SOUTH%20INDIAN%20MOVIES/Hindi%20Dubbed/">South-Movie Hindi Dubbed</a></li>
                                        <li><a class="hvr-bounce-to-bottom" href="http://10.10.10.10/MY-FTP-3/Animation%20Movies/">Animation Movies</a></li>
                                        <li><a class="hvr-bounce-to-bottom" href="http://10.10.10.10/MY-FTP-3/Animation%20Movies%20%281080p%29/">Animation Movies -1080p</a></li>

After getting the HTML, it will replace/update the links in my rclone.conf file.

rclone.conf file preview:

[Hindi Movies]
type = http
url = http://10.10.10.9/MY-FTP-1/Hindi%20Movies/

[English Movies]
type = http
url = http://10.10.10.7/MY-FTP-2/English%20Movies/

[English Movies -1080p]
type = http
url = http://10.10.10.9/MY-FTP-1/English%20Movies%20%281080p%29/

[South-Movie Hindi Dubbed]
type = http
url = http://10.10.10.9/MY-FTP-1/SOUTH%20INDIAN%20MOVIES/Hindi%20Dubbed/

[Animation Movies]
type = http
url = http://10.10.10.10/MY-FTP-3/Animation%20Movies/

[Animation Movies -1080p]
type = http
url = http://10.10.10.10/MY-FTP-3/Animation%20Movies%20%281080p%29/

So I've written a noob script that will start the work but it seems it's giving me an error !

import re
import requests
from bs4 import BeautifulSoup

# Fetch the HTML from the website
html = requests.get("http://10.10.10.1/")

# Parse the HTML
soup = BeautifulSoup(html.text, 'html.parser')

# The location of the rclone.conf file
rclone_conf_file = '/home/user/tmp/rclone.conf'

# Open the rclone.conf file
with open(rclone_conf_file, 'r') as f:
    # Read the file into a list of lines
    lines = f.readlines()

# Iterate over the <a> tags in the HTML
for a in soup.find_all('a'):
    # Get the text of the <a> tag (e.g. "Hindi Movies")
    section_name = a.text.strip().lower()

    # Check if the section name exists in the rclone.conf file
    if any(section_name in line.lower() for line in lines):
        # Get the URL of the <a> tag
        new_url = a['href']

        # Use a regular expression to match the URL in the rclone.conf file
        regex = r'^(\[%s\]\n.*\n.*http.*)' % re.escape(section_name)

        # Update the URL in the rclone.conf file
        for i, line in enumerate(lines):
            if section_name in line.lower():
                print(lines[i])  # <-- Add this line
                lines[i] = re.sub(regex, r'\1', line, flags=re.IGNORECASE)
                lines[i] = lines[i].replace(lines[i].split()[2], new_url)

# Open the rclone.conf file for writing
with open(rclone_conf_file, 'w') as f:
    # Write the updated lines to the file
    for line in lines:
        f.write(line)

The error it's showing:

File "/home/plex/tmp/script.py", line 37, in <module>
    lines[i] = lines[i].replace(lines[i].split()[2], new_url)
IndexError: list index out of range

Have A Octotastic Day !!

Bodo avatar
pt flag
How does your HTML file get changed or generated? Maybe it would be easier to modify a source file in a format that can be parsed easily and generate both the HTML file and the configuration file. How do the section names in `rclone.conf` correspond to the HTML data? Please [edit] your to add requested information or clarification, don't use comments for this purpose.
Bodo avatar
pt flag
Correction: "...[edit] your question to add requested information..."
OrigamiOfficial avatar
tv flag
I've edited it as you requested. I've typed the section names in rclone.conf mistakenly.
terdon avatar
cn flag
Thanks for the edit. So, what do you have so far? Which part of this is giving you trouble? This isn't a free script writing service, but we are happy to _help_ with your script. Please show us what you have written so far and explain what you still need help with.
Bodo avatar
pt flag
In addition to fixing obvious errors you should explain in your text how the values in your `rclone.conf` correspond to the HTML input. Are there any other links in the HTML document? What specific text pattern can be used to recognize the links that should be written to `rclone.conf`? Note that your question is not specifically related to Ubuntu but about general text processing or HTML parsing. It might better fit on https://stackoverflow.com.
OrigamiOfficial avatar
tv flag
Sorry for being late. I've added a dirty script & a regex but don't really know how to make them work together.
cn flag
I would advice to use python for this: do a split on "<li>" and you almost have an array with the 2 items per category you need You can also automate fetching the html
OrigamiOfficial avatar
tv flag
@Rinzwind I've no knowledge about python !
cn flag
basic python takes about 2 hours to learn ;-)
OrigamiOfficial avatar
tv flag
@Rinzwind I've added a python script to my post. Made this script using a friend's help but it showed an error mentioned in the post. Can you help me to solve the issue ?
cn flag
line 35/36 do a print(lines[i]) and it will not show 2 spaces so it errors out on the [2] ;)
OrigamiOfficial avatar
tv flag
@Rinzwind I've done changes as you recommended but now it shows a new error !!
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.