extracting data from web site using php
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Hello

I need a php script to extract data from this site
http://www.asianodds.com/Italy__Serie_A.html

The data should be extracted in a file d:/data/data.txt

If the page was exactly this
https://ibb.co/12xbpCC

The data should would be extracted in this way

001|07/12/2018|19:30|Juventus:Inter Milan|-1:+1|2.090R:1.840B|-1:+1|1.810:2.090|2.25B:2.75|2.010:1.890|1.970:1.930

002|08/12/2018|19:30|Napoli:Frosinone|-2.25R:+2.25B|-2.25:+2.25|-1:+1|1.910:1.990|3.25B:3.5|1.900:2.000|2.020:1.880

... and so on

note
001 is a sequential number
If a data is RED colour there should be an R after the value stored
If a data is BLUE colour there should be a B after the value stored

Thank you

Hi, @graz68, did you tried the solution I provided? Did it work? Thanks.
Codeword 1 month ago
awarded to Chlegou
Tags
PHP

Crowdsource coding tasks.

1 Solution

Winning solution

Hi there, the bounty was challenging, but here i'm solving it! :D

i have used phpquery lib which made manipulating content easy, and fun (you could even easily play with the script)

the script do:

- extract (scrap) the content of the table.

- the code is strong enough, in a way, could bypass many changes and don't break

- save in file with date time name, (so you don't lost last generated data)

- print also in navigator.

my code is available here: https://github.com/chlegou/asianOdds_scrapping

you could see many results example here: https://github.com/chlegou/asianOdds_scrapping/tree/master/data

here is an example of the printed result:

Output:

001|08/12/2018|14:00|Napoli:Frosinone|-2.25R:+2.25B|2.110:1.840|-2.5:+2.5|2.090:1.810|3.25B:3.5|1.880:2.020|2.070:1.830
002|08/12/2018|17:00|Cagliari:AS Roma|+0.5B:-0.5R|1.880:2.060|+0.75:-0.75|2.050:1.850|2.75:2.75|2.060R:1.840B|1.980:1.920
003|08/12/2018|19:30|Lazio:Sampdoria|-1.25:+1.25|1.960B:1.980R|-1.25:+1.25|1.980:1.920|3:3|2.060B:1.840R|2.110:1.800
004|09/12/2018|11:30|Sassuolo:Fiorentina|0:0|2.060R:1.870B|0:0|1.950:1.950|2.5:2.5|2.060R:1.840B|1.990:1.910
005|09/12/2018|14:00|Empoli:Bologna|-0.5:+0.5|2.020B:1.910R|-0.5:+0.5|2.030:1.870|2.5:2.5|1.910B:1.990R|2.060:1.840
006|09/12/2018|14:00|Parma:Chievo|-0.5R:+0.5B|2.090:1.840|-0.75:+0.75|2.020:1.880|2.25B:2.5|1.920:1.980|2.090:1.810
007|09/12/2018|14:00|Udinese:Atalanta|+0.5R:-0.5B|1.870:2.060|+0.25:-0.25|1.800:2.110|2.5:2.5|1.830B:2.070R|1.910:1.990
008|09/12/2018|17:00|Genoa:Spal 2013|-0.5:+0.5|2.030R:1.900B|-0.5:+0.5|1.940:1.960|2.25B:2.5|1.910:1.990|2.130:1.780
009|09/12/2018|19:30|AC Milan:Torino|-0.5R:+0.5B|1.920:2.010|-0.75:+0.75|1.900:2.000|2.5:2.5|1.870B:2.030R|1.900:2.000
010|15/12/2018|17:00|Inter Milan:Udinese|-1.25:+1.25|1.870B:2.030R|-1.25:+1.25|1.980:1.920|2.75:2.75|1.970R:1.930B|1.850:2.050
011|15/12/2018|19:30|Torino:Juventus|+1:-1|1.980B:1.920R|+1:-1|2.000:1.900|2.5:2.5|2.070R:1.830B|2.050:1.850
012|16/12/2018|11:30|Spal 2013:Chievo|-0.5:+0.5|2.050:1.850|-0.5:+0.5|2.050:1.850|2.25:2.25|1.910R:1.990B|1.850:2.050
013|16/12/2018|14:00|Fiorentina:Empoli|-0.75:+0.75|2.050:1.850|-0.75:+0.75|2.050:1.850|2.75:2.75|2.050:1.850|2.050:1.850
014|16/12/2018|14:00|Frosinone:Sassuolo|+0.5:-0.5|2.020R:1.880B|+0.5:-0.5|1.850:2.050|2.5:2.5|2.000:1.900|2.000:1.900
015|16/12/2018|14:00|Sampdoria:Parma|-0.5:+0.5|1.880B:2.020R|-0.5:+0.5|2.000:1.900|2.5:2.5|2.000:1.900|2.000:1.900
016|16/12/2018|17:00|Cagliari:Napoli|+1.25:-1.25|2.110R:1.800B|+1.25:-1.25|2.050:1.850|2.75:2.75|2.050R:1.850B|2.000:1.900
017|16/12/2018|19:30|AS Roma:Genoa|-1.25:+1.25|1.790B:2.120R|-1.25:+1.25|1.950:1.950|2.75:2.75|1.770B:2.140R|1.950:1.950
018|17/12/2018|19:30|Atalanta:Lazio|-0.25:+0.25|2.050:1.850|-0.25:+0.25|2.050:1.850|2.5:2.5|1.790B:2.120R|1.900:2.000
019|18/12/2018|19:30|Bologna:AC Milan|+0.75:-0.75|2.030R:1.870B|+0.75:-0.75|1.900:2.000|2.5:2.5|2.000R:1.900B|1.950:1.950
thank you!! I have some problem however . I receive a Fatal Uncaught error : Call to a memeber function find() on Boolean, any idea please ?
graz68 1 month ago
What php version do you have? Could you tell? The script require php5. Could you pass me all the error?
Chlegou 1 month ago
thank you!! I have some problem however . I receive a Fatal Uncaught error : Call to a memeber function find() on Boolean, any idea please ? I have php designer 8 and execute script from it ... please wait I check which is php verison
graz68 1 month ago
Yes I was using php 7, now I set my phpdesigner 8 to use php 5.6 . Now it executes without errors but the file file.txt is not populated .
graz68 1 month ago
I have this now PHP 5.6.12 (cli) (built: Aug 6 2015 12:06:17) Copyright (c) 1997-2015 The PHP Group Zend Engine v2.6.0, Copyright (c) 1998-2015 Zend Technologies
graz68 1 month ago
Ok I found that the script is saving the file in C:\Users\graziano\AppData\Local\Temp , what I have to do to save the file in d:/data/file.txt ?
graz68 1 month ago
forget , it works now , thank you for excellent work
graz68 1 month ago
First thanks to reward the bounty for me, was challenging and fun and rewarding by the end. :p glad that worked fine in php 5, I might need to test it also in php7. I have made it saving file in data directory, in same index file folder. You could change the path as you want. In your case, 'd:/data'
Chlegou 1 month ago
really excellent work !!
graz68 1 month ago
$Path variable is in line 6 in index file (im limitless now, using mobile out of home).
Chlegou 1 month ago
all works now, thank you!!!!
graz68 1 month ago
Thanks! It was my first time making a web scrapping script, thats why I said challenging :p if you need any improvements, let me know here, or by email: nicolastsue@gmail.com
Chlegou 1 month ago
ok thank you! .
graz68 1 month ago
Good solution. Congrates @chlegou.
Codeword 1 month ago
@codeword, thanks mate, since a long time wr didn't work on bounties together.
Chlegou 1 month ago
Yeah, It's been a long time : ) .Actually, I have been working two of my projects. So couldn't try bounties. One of them is about to finish and the other one I have not started yet.
Codeword 1 month ago
All the best in your projects ;)
Chlegou 1 month ago
I found a bug when using the page http://www.asianodds.com/next_200_games.asp sometime I see errors like this 038|08/12/2018|12:00|Liverpool + Darmstadt 98:Bournemouth + Ingolstadt 04|-1.25:+1.25|2.070R:1.770B|-1.25:+1.25|1.980:1.860|5.5:5.5|1.940:1.880|1.940:1.880 note this Liverpool + Darmstadt 98:Bournemouth + Ingolstadt 04
It's wrong . Can you see which is the problem please ? Thank you.
graz68 1 month ago
Im out of home, I will chek when I wll be back home
Chlegou 1 month ago
@graz68 i just came to test it, i couldn't see that record in the scrapped file, hhhhhhhhhhhhh looooll now i know why i couldn't see that record, because time is passed :p (didn't look in the dataset date and time, that was dump from me lool :p ). ok, could you please identify the issue with the resultset you gave me? (you didn't identify the issue in the last comment). is it the teams name or that record wasn't showing in the table but scrapped in data resultset? or two different games are accumulated?
If this happens again, please make sure to save the page records and pass it to me, so i could make tests and identify the bug.
Chlegou 1 month ago
Hello , indeed I didn't save the page record which caused it, if it repeats I will take a note thank you.
graz68 1 month ago
Hello , I have seen again the problem , it happens often if Iuse $url = "http://www.asianodds.com/next_200_games.asp"; I have this situation https://ibb.co/QQhRM01 and this is the code https://gist.github.com/graz68a/47be16b0dc4b164171a097fedce1a4c2 the result is this 048|10/12/2018|19:30|VfL Bochum + Salernitana:St. Pauli + Brescia|-0.5:+0.5|1.940:1.900|-0.5:+0.5|1.940:1.900|4.75:4.75|1.870:1.950|1.870:1.950 049|10/12/2018|19:30|VfL Bochum + Feirense:St. Pauli + Maritimo|-0.5:+0.5|1.940:1.900|-0.5:+0.5|1.940:1.900|4.5:4.5|1.800:2.020|1.800:2.020 Thank you
graz68 1 month ago
another different problem here Result https://gist.github.com/graz68a/233ff9d60d581e1fd882ee33a6311d25 Source https://gist.github.com/graz68a/cbb29f8d2cbab16aede8b04f01bdbe34 dates and matches are wrong .
graz68 1 month ago
Nice, having two different example is way helpful. I will look on them and get back to you
Chlegou 1 month ago
you didn't said what bug there is, so i'm just guessing. in the second bug, i'm considering the issue is in the date, and i will fix it. both execution results and codes aren't the same. so what i will do is i will compare the results i'm getting, with the table i have for now. i will wait till you replay on this comment, and please consider contacting me by email. we may want to dive deeper inside it. email: nicolastsue@gmail.com
Chlegou 1 month ago
thank you to fix the date. In the first case I reported , using $url = "http://www.asianodds.com/next_200_games.asp"; sometime there are rows like this 048|10/12/2018|19:30|VfL Bochum + Salernitana:St. Pauli + Brescia| . These are 4 different football clubs instead of 2 , in the same row. It should be Salernitana vs Brescia and VfL Bochum vs St. Pauli .
graz68 1 month ago
yeah i noticed that after the first look, but when executing, didn't get that result! that's why i gave you my email.
Chlegou 1 month ago
ok, issue#2 is fixed, i have updated the git repo, the code is available there. for issue#1, there is no errors in the code you gave me, there is the result of the gist you gave me : https://gist.github.com/chlegou/9436d50c646c43731040ec3052a205ea also, tried to test online, and bingo i found the issue! the script is working right, with no errors, it exist duplicated names in that page. you could see that in this screenshot https://ibb.co/ZYShWSh this is the source taken at same exact moment: https://gist.github.com/chlegou/0277a840e266d33f993539602f97eddc like that, to resume, issue#1 wasn't right (sinse fantasy matches has duplicated teams name) and issue#2 is fixed and available in the git repo. i have tested also the code in PHP/7.2.11 and worked fine.
Chlegou 1 month ago
thank you, it works perfectly now . I hope to win some money soon to thank you with more money too.
graz68 1 month ago
I know, I am a nuisance . If you have still patience , there are still issues. right now I have this in asianodds https://gist.github.com/graz68a/c0686a865e1228bcba8a6c14247f91fd using $url = "http://www.asianodds.com/next_50_games.asp"; the result is the following https://gist.github.com/graz68a/0161577d8c0684978184cc94c72525aa All the records having a " + " between the football clubs name are wrong , because there are 4 football clubs instead of two .
graz68 1 month ago
So do you want to skip any records, holding 4 teams insted of two? (According to what I see, the various matches section)
Chlegou 1 month ago
no, I will explain please one moment
graz68 1 month ago
I will try that nd get back to you
Chlegou 1 month ago
there is something abnormal for real! the result you are giving, didn't match the source at all! i have executed it, and received something different! btw, this isn't the first time, all the results you are giving, miss match the entry you have, and the entry (you are giving) match the result i'm getting! it must be something abnormal! at least that i could say. i'm suggesting, we make a teamviewer call, so we could figure out what is going on. i'm thinking, this could be a bad configs.
Chlegou 1 month ago
ok I understood it works ok for you, ok please let me try different php version in this case.
graz68 1 month ago
please can you tell me which is your php version , and if there are special requirements to run your script ?
graz68 1 month ago
the code is running perfectly, in PHP/5.6.35 and PHP/7.2.11 perfectly, with no errors and outputs matching inputs! and giving the wanted results. the problem is that your output, didn't mach your input! that's why, i wanted to check.
Chlegou 1 month ago
no, only one question are you using $url = "http://www.asianodds.com/next_50_games.asp"; to test ? because the problem seems to happen only using $url = "http://www.asianodds.com/next_50_games.asp";
graz68 1 month ago
within the script, i could test directly on a webpage, or local file. when you give me a source file, i make it as input and test on it directly
Chlegou 1 month ago
i believe that there is something abnormal, that's why i'm looking to see what is going on. seriously, i couldn't believe this mysterious issue. the problem is, even your input and output, didn't match!
Chlegou 1 month ago
please make a compare, between the static source file, and your output. manually, so you might see a difference
Chlegou 1 month ago
I do not understand what you mean with "static source file" . Please can you show me your output using $url = "http://www.asianodds.com/next_50_games.asp" and report me when you executed it (time) ?
graz68 1 month ago
Static source file is the html code you are passing to me using gist, I'm saving it to a local file and making it as input instead of the direct webpage, and extracting output from it. The source code you are giving didn't match the output you are giving too. That's why im confused.
Chlegou 1 month ago
it's seems a php version problem , because now I tried php 5.6.12 and I received a different output . Even if now it's not really perfect , there are some missing match and there are some match which should not be in the page :-O , really weird. I will try using exactly your PHP/5.6.35 now .
graz68 1 month ago
I was thinking, maybe we could make a teamviewer session, so I could check wisely if you accept that. That's what we really need.
Chlegou 1 month ago
using 5.6.35 now. the output is correct however I see some match which is not in the page for example 050|12/12/2018|15:00|Benfica U19:AEK Athens U19|-3.25:+3.25|1.720:2.130|-3.25:+3.25|1.720:2.130|4.25:4.25|1.770:2.050|1.770:2.050 but if you open now http://www.asianodds.com/next_50_games.asp , latest match is on 14:00 and there is no Benfica U19:AEK Athens U19 match :-O
graz68 1 month ago
I understood why it happens ,as it seems if I use $url = "http://www.asianodds.com/next_50_games.asp it loads this page http://www.asianodds.com/default.asp?mode=1bet , which contains different matches.
graz68 1 month ago
ok at this point all seems to be ok , it was a php version issue.
graz68 1 month ago
Yeah but this is abnormal, I believe you are right, and the script is right, its related to either server cashe or the cul in php, I might look into it. But to do that I need a wise check in both versions..... anyway, if there is any news I will let you know. I will investigate in it
Chlegou 1 month ago
Thank you!
graz68 1 month ago
As i promised, i have investigated the issue, according to this answer: https://stackoverflow.com/a/17741493/4771750 , results may differ from browsers than CURL, didn't find anything about CURL differing from PHP versions, but who knows...... any way, i have canceled CURL, using DOMDocument to fetch url content, maybe could return same result.... who knows, please test on any php versions you have and let me know. code as always, available in github. use diffchecker to look for results differences. https://www.diffchecker.com/
Chlegou 1 month ago
Thank you! I tested your new version using http://www.asianodds.com/next_200_games.asp and I found errors in the output , for example 037|15/12/2018|12:00|St. Pauli + Union Berlin:Greuther Furth + VfL Bochum|-1.25:+1.25|1.900:1.940|-1.25:+1.25|1.900:1.940|4.75:4.75|1.720:2.110|1.720:2.110 However every time I use exactly this http://www.asianodds.com/default.asp?mode=188bet (which is the link I really need), the output is always correct with Curl and without Curl. As it seems without using ?mode=188bet there is always some error in output ( for me ) , however again not a problem for me , because for me is useful http://www.asianodds.com/default.asp?mode=188bet and it always works ok .
graz68 1 month ago
This problem start bothering me for real!
Chlegou 1 month ago
IMHO is perfect for me now, please don't worry.
graz68 1 month ago
Ok, since 2 days i was trying to add that, but i just did. since the problem was knowing if the script is wrong or he is right but getting a different input comparing to what we see in browsers, i have got the idea "since we have the web page content from server, why not we don't save it in a file?!" i was really dump, that i couldn't think of that from the first day. since like that, we could coumpare what the script is getting as input, and as output. Also, gives us the answer, if we are getting different results from what we see in browser. How dump i was! :p Anyway, the new committed script is saving the source fie also, so from now you are getting two files, data_xxxx.txt and data_xxxx_source.html. so we could wisely debug the script, and see if really the script is mistaking.
Chlegou 1 month ago
Thank you , sorry for delay , testing now.
graz68 1 month ago
I described the problem here https://gist.github.com/graz68a/fd827c9d87e1633438a117f154c9e092 , thank you. I do not know why , the script works only using this link http://www.asianodds.com/default.asp?mode=188bet , using other links does not work.
graz68 1 month ago
i didn't noticed your last comment, sorry for delay. ... from your input, i see that the script is working perfectly! these records are actually in the webpage! see: https://ibb.co/g9BMqgh . i already told you that there is a fantasy matches section containing 4 clubs and sometimes more. if you don't want that section to get scrapped, i could work on that.
Chlegou 1 month ago
Regarding 4 matches , could there be a problem when the script creates the data20181219085406_source.html file ? I am asking this because if I open the page url using the browser I never seen those records with 4 matches , never ; while I see very often this problem in the output.
graz68 1 month ago
yes, you could do that. and that's what i was thinking from the beginning, that you are missing the fantasy matches section check ( i talked in that in comment#22). so for resuming, you could add a static file as input. also, i could modify the script in a way that avoid all the fantasy matches section, so they will be ignored. you only have to change that to make it happens: https://github.com/chlegou/asianOdds_scrapping/blob/master/index.php#L42-L44 (you might need to comment the fetch and save for live source, since not needed)
Chlegou 1 month ago
as it seems if Asianodds recognizes a scraper it adds this section "Various : Fantasy matches" ...
graz68 1 month ago
it adds various sections of these "Various : Fantasy matches" , if you could you add code to ignores matches listed in "Various : Fantasy matches" could be great, or if you could find a way that Asianodd is not able to recognize your script as a scraper.
graz68 1 month ago
the second option isn't guaranteed, but we could easily ignore the fantasy matches as i said before. i will add that later and let you know.
Chlegou 1 month ago
thank you, the first option should be enough , I'm checking/comparing sources and if I am not wrong Asianodds is not messing other data, it's only adding those fantasy sections.
graz68 1 month ago
i know i took too much to respond, but i managed to add the change to skip "various fantasy matches section". https://github.com/chlegou/asianOdds_scrapping/blob/master/index.php#L108-L111 this code will do it, un-commenting it, will scrap them. hope this will help you out
Chlegou 26 days ago
Thank you!
graz68 25 days ago
View Timeline