Using php and curl to scrape a web page

I often use web scraping code, of which the below is an example snippet, for looking at technology as part of an IT Assessment, Due Diligence or Review. For this post, I am assuming that the latest stable version of php and curl are installed and working. Below is some generic web scraping code that works well for most web sites.

5/20/20241 min read

I generally grab all of the text on the page and then sort through it. Once I have worked out what I want to keep then I discard the raw data.

# $target is set to whatever web page url I am looking to scrape. I can update it so that if it finds Next or page 2 etc then it will cycle around again.

$target = "https://<web site url>";

# Curl needs a User Agent so that it can emulate the end user browser. I often have to play around with different User Agents before I find one that works reliably.

$user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36';

# If going to use Tor then need to define a proxy and port. I tend not to use this.

$proxy = "127.0.0.1";

$port = "9050";

# get a cookie and set up the web scraping request

$ckfile = tempnam ("/home/pgroom", "targetwebpagecookie.txt");

$ch = curl_init($target);

curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

curl_setopt($ch, CURLOPT_URL, $target);

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

curl_setopt($ch, CURLOPT_FAILONERROR, TRUE);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

curl_setopt($ch, CURLOPT_MAXREDIRS, 4);

curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);

curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

curl_setopt($ch, CURLOPT_ENCODING, "");

# The next line is for debugging only and spits out the headers so you can look at the handshakes.

# curl_setopt($c, CURLOPT_VERBOSE, TRUE);

# The next two lines are required for https web sites and are not secure in any way. If I am productionising something then I tighten these up.

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);

# This section is only required for Tor, hence why it is commented out

# curl_setopt($ch, CURLOPT_PROXYTYPE, 7);

# curl_setopt($ch, CURLOPT_PROXY, $proxy.':'.$port);

# This is where the actual web scraping request is made and errors (if any) are flagged.

$initpage = curl_exec($ch);

$curl_errno = curl_errno($ch);

$curl_error = curl_error($ch);

curl_close($ch);

# If there are any errors then they are displayed here.

if ($curl_errno > 0)

{

print "\nCurl error no:".$curl_errno;

print "\nCurl error:".$curl_error;

}

And that is all there really is to it and I hope that the above may prove useful. I have also written posts on Docker, Ansible and Selenium & Python.

Thanks

Pete