Python: Need to scrape a website that requires login,
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I cannot get past the login screen. Full source for the login page https://pastebin.com/v5u4bPNj
Looking for a complete example that:
-performs a login with a correct UID/PW.
-lists all the links for a specific URL.
-handles invalid login and invalid URL requests,

This is the template I have been trying to get to work.
https://pastebin.com/PGm3uECz , but I am open to any working approach.

1) The Alfresco site is an internal (behind our firewall) app, so I can not provide a userid.

2) No node.js, it needs to be python3 ( I should have specified that).

3) csrf token was just a place holder, I could not find any tokens in the source page, but I could have missed them.

4) On the links, a simple list of the HTML links in the tag.

5) The form header information recorded during login is here:
https://pastebin.com/31nfKU7T

6) The login java script recorded in the console is here: https://pastebin.com/kdZNpk2s

Thanks!

Would a solution using Node.js be acceptable? Can you be more specific on the 'list all the links for a specific URL' part?
kostasx 14 days ago
Hello Broadreach,Can I get a test user name and password so that I can test my python script.
Codeword 14 days ago
Hey, I noticed you have commented out your csrf token.If the website you are trying to login requires a csrf token, then you will need to pass the csrf token along with other data(username and password) or else it will return the login page. payload = { "username": USERNAME, "password": PASSWORD, # "csrfmiddlewaretoken": authenticity_token }
Codeword 14 days ago
To verify that there is no csrf token do this, Go to the website and on the login page open developer console->go to network tab and sign in to the page manually , When you do so, you will see some sort of file loading related to login(which seems most appropriate) and click that and go to the headers tabs-> form data section. see if there is any csrf token there apart from the username, password.If there is csrf token my above method will solve this issue.If not we can take other steps.Thank you
Codeword 14 days ago
I am very much sure there is csrf token is enabled. see this in your html Alfresco.constants.CSRF_POLICY = { enabled: true, cookie: "{token}", header: "{token}", parameter: "{token}", properties: {} }
Codeword 14 days ago
or provide me with the javascript files so that I can test them on my system.As I have said to you that this line Alfresco.constants.CSRF_POLICY = { enabled: true, cookie: "{token}", header: "{token}", parameter: "{token}", properties: {} } says that csrf is enabled, and you are saying it's not.Try setting enabled: falseand try your python code then. Thank you.
Codeword 14 days ago
I do not see any elements in the network tab that look like yours. The closest I get is this java script: https://pastebin.com/r3NZVZaG
broadreach 14 days ago
Just set LOGIN_URL = "http://share.baner.com/share/page/dologin" and enabled: false in Alfresco.constants.CSRF_POLICY as I suggested above.Let me know how it goes. Thank you
Codeword 14 days ago
Have tried setting the parameters as above? If the above soln is not working there could be another possibility.That is I can see apart from username and password, the form is also submitting two other hidden input values named success and failure, which we are not passing in our python code.
Codeword 14 days ago
Sorry, because I don't have the files with me here locally so that I can test.This is the only way for me to help you.More over I don't thing there are any major issue with your python code apart from the login url.I think we are not passing the parameter right in our post request.Thank you
Codeword 14 days ago
awarded to Codeword

Crowdsource coding tasks.

3 Solutions


Try import.io ...

Thanks, but not working. The logon form is returned.
broadreach 14 days ago

Try setting

LOGIN_URL = "http://share.baner.com//share/page/dologin"

Update with a cleaner version

import requests
from bs4 import BeautifulSoup

USERNAME = "lmcrory"
PASSWORD = "xxxxx"

LOGIN_URL     = "http://share.baner.com//share/page/dologin"
DASHBOARD_URL = "http://share.baner.com/share/page/user/lmcrory/dashboard"

def listLinks():
    s = requests.Session()

    # Perform login
    result = s.post(LOGIN_URL, data={
        "username": USERNAME, 
        "password": PASSWORD, 
    })

    # Scrape url
    html = s.get(DASHBOARD_URL).content
    soup = BeautifulSoup(html, "html.parser")
    for link in soup.select("div.repo-list--repo > a"):
        print("{}\t{}".format(link.text, link.attrs["href"]))

listLinks()
Thanks for the tip @broadreach
tomtoump 13 days ago

Try setting enabled = false as shown below in the html file

Alfresco.constants.CSRF_POLICY = {
         enabled: false,
         cookie: "{token}",
         header: "{token}",
         parameter: "{token}",
         properties: {}
      };

Thanks :)

Thanks, but how do I send the policy information?
broadreach 14 days ago
I know your information is private, I can understand, but don't worry, we have 6 more days before the bounty expires. We can try various approaches and I am hopeful that we come out with a solution :).But I am sure there is not a major issue with your python code that why I am putting more emphasis on the other side.First try to set Just set LOGINURL = "http://share.baner.com/share/page/dologin" and enabled: false in Alfresco.constants.CSRFPOLICY and let me know what are you getting in response from python code.That way I would be helpful for me to debug it.Thank you
Codeword 14 days ago
As soon as you have done this let me know, so that tell you the next thing, i am thinking.
Codeword 14 days ago
Where / how do I set enabled=false?
broadreach 14 days ago
In your HTML file you sent find Alfresco.constants.CSRFPOLICY and there you set enable property to false.Moreover, don't forget to set the login url as http://share.baner.com/share/page/dologin in python code and let me know what the python code is returning in output.Thank you
Codeword 14 days ago
So dit it work ?Thank you
Codeword 14 days ago
Perfect! Thanks for pushing through this. Its so much tougher when you cant test. Exactly what I needed. Thanks again!
broadreach 14 days ago
Thank you for awarding me the bounty.
Codeword 14 days ago
Really? Stealing my solution?
tomtoump 13 days ago
Well deserved, I hope you got the tip too?
broadreach 13 days ago
The JS code has nothing to do with the issue.
tomtoump 13 days ago
@ tomtoump Hey, man easy, I didn't know what the problem was as I didn't have the files with me, So I tried to test every possible way so that it could work.Didn't steal your solution.
Codeword 13 days ago
Suggesting the same as me, after me, is kind of stealing my solution.
tomtoump 13 days ago
Are you kidding me? in the era of internet,coincidence is normal.Just do I thing type your name in google and probably you will find a domain, anything you type. Moreover, I agree that login url suggestion is same as mine and it has to be as there is only one login url, but you are thinking is not unique, neither is mine. So the one thing that matter is how much effort you put in solving the issue, guessing the submit url right is not the issue, it's very basic. The issue is proving your solution is 100% accurate, and also effort and assistance to provide.Let me cite an example, The idea of gravitational waves being present is Einstein's, but then why Nobel prize went to the scientists who proved it's existence and not Einstein.Thank you
Codeword 13 days ago
LOL. Very basic, but you needed 10 hours and me suggesting it first, before the copy-paste. Anyway, I don't have any problem with you, but with @broadreach for not selecting the solution he should.
tomtoump 13 days ago
I too don't have any problem with anyone.But if you still think I copied, for your reference, check my account against yours and also the hit ratio.Thank you
Codeword 13 days ago
Hey @Codeword, i'm @chlegou we worked together in many projects, can you please email me on my email? i really need to talk to you, ASAP! my email is : nicolastsue@gmail.com looking for a quick answer from you. :)
Chlegou 8 days ago
View Timeline