I generally grab all of the text on the page and then sort through it. Once I have worked out what I want to keep then I discard the raw data.
# $target is set to whatever web page url I am looking to scrape. I can update it so that if it finds Next or page 2 etc then it will cycle around again.
$target = "https://<web site url>";
# Curl needs a User Agent so that it can emulate the end user browser. I often have to play around with different User Agents before I find one that works reliably.
$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36';
# If going to use Tor then need to define a proxy and port. I tend not to use this.
$proxy = "127.0.0.1";
$port = "9050";
# get a cookie and set up the web scraping request
$ckfile = tempnam ("/home/pgroom", "targetwebpagecookie.txt");
$ch = curl_init($target);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_ENCODING, "");
# The next line is for debugging only and spits out the headers so you can look at the handshakes.
# curl_setopt($c, CURLOPT_VERBOSE, TRUE);
# The next two lines are required for https web sites and are not secure in any way. If I am productionising something then I tighten these up.
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
# This section is only required for Tor, hence why it is commented out
# curl_setopt($ch, CURLOPT_PROXYTYPE, 7);
# curl_setopt($ch, CURLOPT_PROXY, $proxy.':'.$port);
# This is where the actual web scraping request is made and errors (if any) are flagged.
$initpage = curl_exec($ch);
$curl_errno = curl_errno($ch);
$curl_error = curl_error($ch);
curl_close($ch);
# If there are any errors then they are displayed here.
if ($curl_errno > 0)
{
print "\nCurl error no:".$curl_errno;
print "\nCurl error:".$curl_error;
}
And that is all there really is to it and I hope that the above may prove useful. I have also written posts on Docker, Ansible and Selenium & Python.
Thanks
Pete