Xpath query to get Tweet text
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I'd like the xpath query to get a Tweet's text. I asked this question before but the winning solution no longer works:

https://bountify.co/get-tweet-text-xpath-query

Please provide an updated solution, in the same format as the previous solution. See that question for details.

Thanks!

Edit

Please see original question for solution format. The solution should work here: http://app.oraclize.it/home/test_query

The full Twitter page is not loaded by Oraclize (permalink-tweet-text dynamically, so Oraclize is not able to see it. That's why the original accepted solution no longer works).

However, I noticed that a div class="main-tweet-container" and a table with class main-tweet do show up, though, and there is a tweet-text class within.

This is pretty close: //div[contains(@class, 'main-tweet-container')]//text()

However, needs to also select the div within with class tweet-text.

Are you sure it isn't a bug on Oraclize side?
gabrielsimoes 6 months ago
@gabrielsimoes It's possible, but entering other xpath queries on the Oraclize site works.
bevan 6 months ago
@bevan, i have updated my solution, with a working one, please check it
Chlegou 6 months ago
awarded to alixaxel
Tags
html
xpath

Crowdsource coding tasks.

3 Solutions


It looks like a bug from Oraclize, since the xpath in other development is working just fine.

Since twitter uses dynamic loading, maybe the xpath made just scraps the first occurring elements but not those who are loaded after. Why not parsing the regular result and getting only the second value with a middleware?:

html(https://twitter.com/BarackObama/status/974015556675391488).xpath(//*[contains(@class, 'tweet-text')][1]//text())
Unfortunately it's not possible as a smart contract receives and must parse the result (cannot process much text). Is there another query that would work?
bevan 6 months ago
Hints: doesn't seem to be Oraclize bug, rather appears to be Twitter issue, because when I use Nokogiri and download that page (without JS), the "permalink-tweet-text" is not showing up (it was showing up before). However, I noticed that a div class="main-tweet-container" and a table with class main-tweet do show up, though. Maybe that helps?
bevan 6 months ago
This is pretty close, but also needs to select the first paragraph with tweet-text class: //div[contains(@class, 'main-tweet-container')]//text()
bevan 6 months ago
Winning solution

The following XPath should give you the result you want:

id("main-content")//*[@class="tweet-text"]//text()

Or if you want to make it more efficient:

id("main-content")//div[@class="tweet-text"]//text()

Output:

[
    "\n                  ",
    "  Just because I have more time to watch games doesn\u2019t mean my picks will be better, but here are my brackets this year: ",
    "go.obama.org/2018bracket",
    " ",
    "pic.twitter.com/gnNXw0Ysxr",
    "\n",
    "\n                "
]

Explanation

Basically Twitter is serving different content to Oraclize. To target the main tweet and ignore replies, we use:

  • id("main-content")

Then we get the tweet-text child div:

  • div[@class="tweet-text"]

Since we are only targeting the main tweet, this will ignore all the replies, so we don't need to access by index.

  • text()

Lastly, we grab all the text nodes (different paragraphs) from the main tweet content.

Hi Alixaxel, this one doesn't seem to work. Could you provide a query that works on the Oraclize xpath tester? I have made that requirement more clear. Thanks. Also see comments on @enderdba's solution.
bevan 6 months ago

Hi there,

after reviewing the source code of the the twitter webpage, i have made few tests online, and noticed, that all tweets, have same classes, of their DOM elements, here is an example,

<div class="js-tweet-text-container">
  <p class="TweetTextSize js-tweet-text tweet-text" lang="en" data-aria-label-part="0">
  TWEET_TEXT_HERE
  </p>
</div>

Except for the main tweet, that looks highlighted, which has an extra class .TweetTextSize--jumbo in it's container. following logic basics, this class is what highlight the main tweet now, not .permalink-tweet-container as before (following last solved bounty).

with that, the main tweet will have an html code similar to that:

<div class="js-tweet-text-container">
  <p class="TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text" lang="en" data-aria-label-part="0">
  TWEET_TEXT_HERE
  </p>
</div>

With this Xpath, i was able to get the main tweet, since .TweetTextSize--jumbo is unique in the webpage.

//*[contains(@class, 'TweetTextSize--jumbo')]//text()

like that, i think final solution, will looks like that:

html(
https://twitter.com/BarackObama/status/974015556675391488
).xpath(
//*[contains(@class, 'TweetTextSize--jumbo')]//text()
)

IMPORTANT: since xpath, parse xml codes, i noticed that it fails when it find &nbsp; in the page (or tweet) (added for links in tweets) this might be a problem, so you need to work on that. because following logic basics also, you last expression //*[contains(@class, 'tweet-text')][1]//text() should work fine

NB: tests was made online here: http://www.xpathtester.com/xpath

UPDATE:

After few playarounds, i was able to get it right, using this xpath:

html(https://twitter.com/BarackObama/status/974015556675391488).xpath(id("main-content")//*[@class="tweet-text"][1]//text())

i tried it also, with another tweet, and worked fine.

Thanks for looking into this Chlegou. Could you provide a query that works on the Oraclize xpath tester (see format of solution on original question)? That is a requirement that I have made more clear.
bevan 6 months ago
@bevan, i have updated my solution, with a working one, please check it
Chlegou 6 months ago
View Timeline