Automate LinkedIn Scrape Content Feed (Personal Use)
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Hi, looking for a way to scrape my content feed on LinkedIn (as I don't have time to read it all myself each day).

Need help from someone more proficient than me (I am not a developer I should say) in Beautiful Soup and Selenium.

Here is the workflow I want to implement:

  1. Log into LinkedIn
  2. Search content for keywords (provided as input argument)
  3. Scroll down the feed (how far should be configurable option)
  4. Extract output to flat file.

I am stuck on 3, I can get it to page down but it still scraps the first 3 posts always. Could be something very trivial.

Open to Tip on this one to complete these workflow and for general advice and pointers.

Code here:

#! Python3
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, sys

# Get content keyword from command line if provided and form url search string
searchString = ''
if len(sys.argv) > 1:
    for x in sys.argv[1:]:
        searchString = (searchString + '%20'+x)
    # If nothing provided in command line default to something
    searchString = "GPT3"

# Create urlString with search
urlString ='' + searchString + '&origin=GLOBAL_SEARCH_HEADER' 

# Try logging into LinkedIn
browser = webdriver.Edge(executable_path=r'C:<your path to>/msedgedriver.exe')

    userElem = browser.find_element_by_id('username')
    userElem.send_keys('<your username>')
    pwdElem = browser.find_element_by_id('password')
    pwdElem.send_keys('<your password>')
    ### Alternative: loginElem = browser.find_element_by_xpath('//*[@id="app__container"]/main/div[2]/form/div[3]')
    loginElem = browser.find_element_by_xpath('//button[text()="Sign in"]').click()
    print("Error Logging in")

# Now call LinkedIn with our search string
bsScrape = bs4.BeautifulSoup(browser.page_source, 'html.parser')

# Try to scrape posts

#Setup number of times page down via END click

numPageDown = 10

for i in range(1,numPageDown):
    for post in'span[class="break-words"]'):
        print(post.get_text(' ', strip=True))
        pageDown = browser.find_element_by_tag_name('html')

Crowdsource coding tasks.

1 Solution


Hi @monkeydust.

As a non-developer person, you did a pretty good job :)

The reason why your script was always fetching first x posts was because you haven't provided BeatifulSoup with a new page source. Your script executes PAGE_DOWN which loads new posts and page source changes, but BeatifulSoup still has your old page source.

You can find the complete script here.

A sample file generated from my feed here.

Also, you've used camelCase naming convention and I've used snake_case naming convention which is Python's default. I left you to choose whichever one you like the most and optionally you can change the rest (or ask me to do it).

Example CLI usage

python3 --depth 5 --keywords keyword1 keyword2 keyword3

Thank you,

Hi Vladimir, Many thanks - makes sense what you say, should have figured that out. Thanks for changes also, things to learn from, I just put back in my login details and edge driver path and ran it (no command lines so goes to GPT3) and got this error: Any idea? Sam
monkeydust 2 years ago
Hi Sam! No problem. I've improved version of the script here. Could you please check it out? The only thing thing that has changed is line 52. I've explicitly set encoding to be UTF-8. In case this doesn't work, could you provide with your operating system and python version? Windows is known for these encoding problems so we might go through a few attempts until we identify where the issue is. Thank you for your understanding.
VladimirMikulic 2 years ago
Perfect, works on default. Few Qs
  1. How do I use the parser, when i type --keywords "happy friday" it still searches for GPT3
  2. Also when I change required=False to True it still goes off and searches GPT3 (default) when I would expect and error saying please provide keyword parameter.
  3. For my understanding can you explain what this "keywords = 'parser.parse_args().keywords or []" does specifically the "or[]" ?
  4. As advice anything further I should be doing to obfuscate this from LinkedIn - its personal use and using a dummy account but want to minimize chance of it getting picked up as auto scraper?
Thanks again Sam
monkeydust 2 years ago
Hi Sam! I apologise for the mistake. CLI validation code goes to the top of the file so it is accessible to other functions. Here is an improved version. Regarding your questions 1 & 2, I've set --keywords flag to be optional. In case it is not provided it won't throw an error. Instead, it will use the default search string (GPT 3 in our case). If you require keywords to be passed to the script (--keywords) flag, you can set required to True and remove the default string code (it becomes redundant).
VladimirMikulic 2 years ago
Addressing 3rd question: keywords = 'parser.parse_args().keywords or [] In case that keywords are not passed to the script, parser.parse_args().keywords returns None. None is a special data type in Python that indicates that there is no value at all (i.e. nothing, empty). None becomes a problem in the if statement (next line) -> len(keywords). We try to get the length of the keywords but if keywords is None it will throw an error and the program would crash! To avoid this and make sure that no matter what, keywords is always a list, we add or []. This line means: let the keywords be the value of parser.parse_args().keywords but only if the value of parser.parse_args().keywords is not None. If it is None then set it to an empty list to prevent an error.
VladimirMikulic 2 years ago
Addressing 4th question: There are various ways to minimize detection. However, it is good to know that Selenium was built for testing websites/apps that the person owns. That's why it's a bit hard to hide it since it doesn't hide by default, nor there is an option for that. The best way to minimize the chance of getting caught is to behave like a human. In our case, we sign in and scroll a few times through the feed (with a delay of a few seconds) which is normal for a human. On the other hand, sending hundreds of job applications at once greatly increases your chance of getting caught. Changing some headers won't prevent you from being detected -> be as human as possible.
VladimirMikulic 2 years ago
This is a lot of information to process. Hopefully, this makes sense to you. Let me know if there is something else I should clarify/change. Thank you.
VladimirMikulic 2 years ago
It's also worth to mention that LinkedIn provides an API that allows you to fetch information from your account the legal way :)
VladimirMikulic 2 years ago
Many thanks for your explanations, really helps. So I did a quick try seems to work now. Will finish of testing tomorrow and close out. Regarding LinkedIn API if you click on that link you will see it says on the top it says its been deprecated. I did some further digging and they do have an API but not easy to get and I couldnt find anything that would do what this script does with regards to taking out content. I do this workflow daily so instead now it will be generated to file and emailed to me, this part I will try to build myself :)
monkeydust 2 years ago
It's a pity that LinkedIn API is deprecated. The new one seems to be all around building apps that integrate with the platform. Good luck in your future endeavours with Python. If you get tired of it, feel free to contact me :)
VladimirMikulic 2 years ago
Hi Vladimir, one more ask (will add to tip if your ok?) - I am trying to get the Posters Name, Job Title and Age of post deiminated by comma. Struggling a bit with scrapping the right.

Output would be e.g.

Joe Bloggs, Head of Acme Inc,

monkeydust 2 years ago
Hi @monkeydust. I've created the script according to your specification. You can find the script here and the example output here. Thanks.
VladimirMikulic 2 years ago
@monkeydust I've responded to your query. Please check your inbox. Thanks.
VladimirMikulic almost 2 years ago