From RFC 1808, Section 2.1, every URL should follow a specific format:
Copy<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Lets break this format down syntactically:
scheme: The protocol name (which you'll usually see as http/https)netloc: Contains the network location - which includes the domain itself (and subdomain if present), the port number, along with an optional credentials in form ofusername:password. Together it may take form ofusername:password@example.com:80.path: Contains information on how the specified resource needs to be accessed.params: Element which adds fine tuning to path. (optional)query: Another element adding fine grained access to the path in consideration. (optional)fragment: Contains bits of information of the resource being accessed within the path. (optional)
Lets take a very simple example to understand the above clearly:
Copyhttps://cat.example/list;meow?breed=siberian#pawsize
In the above example:
httpsis the scheme (first element of a URL)cat.exampleis the netloc (sits between the scheme and path)/listis the path (between the netloc and params)meowis the param (sits between path and query)breed=siberianis the query (between the fragment and params)pawsizeis the fragment (last element of a URL)
This can be replicated programmatically using Python's urllib.parse.urlparse:
Copy>>> import urllib.parse
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> urllib.parse.urlparse(url)
ParseResult(scheme='https', netloc='cat.example', path='/list', params='meow', query='breed=siberian', fragment='pawsize')
Now coming to your code, the if statement checks whether or not the next_page exists and whether the next_page has a netloc. In that login() function, checking if .netloc != '', means that it is checking whether the result of url_parse(next_page) is a relative URL. A relative URL has a path but no hostname (and thus no netloc).
From RFC 1808, Section 2.1, every URL should follow a specific format:
Copy<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Lets break this format down syntactically:
scheme: The protocol name (which you'll usually see as http/https)netloc: Contains the network location - which includes the domain itself (and subdomain if present), the port number, along with an optional credentials in form ofusername:password. Together it may take form ofusername:password@example.com:80.path: Contains information on how the specified resource needs to be accessed.params: Element which adds fine tuning to path. (optional)query: Another element adding fine grained access to the path in consideration. (optional)fragment: Contains bits of information of the resource being accessed within the path. (optional)
Lets take a very simple example to understand the above clearly:
Copyhttps://cat.example/list;meow?breed=siberian#pawsize
In the above example:
httpsis the scheme (first element of a URL)cat.exampleis the netloc (sits between the scheme and path)/listis the path (between the netloc and params)meowis the param (sits between path and query)breed=siberianis the query (between the fragment and params)pawsizeis the fragment (last element of a URL)
This can be replicated programmatically using Python's urllib.parse.urlparse:
Copy>>> import urllib.parse
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> urllib.parse.urlparse(url)
ParseResult(scheme='https', netloc='cat.example', path='/list', params='meow', query='breed=siberian', fragment='pawsize')
Now coming to your code, the if statement checks whether or not the next_page exists and whether the next_page has a netloc. In that login() function, checking if .netloc != '', means that it is checking whether the result of url_parse(next_page) is a relative URL. A relative URL has a path but no hostname (and thus no netloc).
Copyimport urllib.parse
url="https://example.com/something?a=1&b=1"
o = urllib.parse.urlsplit(url)
print(o.netloc)
example.com
from urllib.parse import * url = "www.google.com" #Add scheme to the URL url = urlparse(url)._replace(scheme = 'http').geturl() print(url) #OR anotherURL = urlunparse(urlparse(url)._replace(scheme = 'http')) print(anotherURL)
>>> 'http:///www.google.com'
In attempt to add scheme to the URL, I've obtained a ParseResult object of the URL and tried to change/set the 'scheme' key to 'http'. However, when I try to unparse the URL - expecting http://www.google.com in output - using geturl() method or urlunparse() function, the result turns out to be http:///www.google.com. What could be the explaination for this?