Using the PHP cURL extension
It sometimes happens that you need to get something from another server in PHP. It's tempting to just use file_get_contents('http://example.com/'), but if you do that you won't have any control over what happens if that server's down, or if it's redirecting. Using the PHP cURL extension, you get access to a powerful library for making HTTP requests and handling the output. Here's an example of how it works. There's far too many comments, and I may have done something wrong. If so, please don't kill me! Use my contact form and let me know.
Fill in the blanks
If you use this code yourself, be a responsible coder and change the CURL_USERAGENT string! Make it yours.
Read the known issues and notes after the example, too!
<?php
function httpGet($url, $ttl = 86400)
{
/* Change this or make it an option as appropriate. If you're
* getting urls that shouldn't be visible to the public, put the
* cache folder somewhere it can't be accessed from the web
*/
$cache_path = dirname(__FILE__).'/cache';
/* Check the cache first - setting force_refresh True overrides
* the check. I'm using crc32() to make URLs safe here; if you're
* fetching millions of URLs, it might not be different enough to
* avoid clashes. If you get collisions, use md5() or something,
* and change the sprintf() pattern.
*/
$cache_file = sprintf('%s/%08X.dat', $cache_path, crc32($url));
$cache_exists = is_readable($cache_file);
/* If the cache is newer than the Time To Live, return it
* instead of doing a new request. The default TTL is 1 day.
*/
if ($ttl && $cache_exists &&
(filemtime($cache_file) > (time() - $ttl))
)
{
return file_get_contents($cache_file);
}
/* Need to regenerate the cache. First thing to do here is update
* the modification time on the cache file so that no one else
* tries to update the cache while we're updating it.
*/
touch($cache_file);
clearstatcache();
/* Set up the cURL pointer. It's important to set a User-Agent
* that's unique to you, and provides contact details in case your
* script is misbehaving and a server owner needs to contact you.
* More than that, it's just the polite thing to do.
*/
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'ExampleFetcher/0.9 (http://example.com/; bob@example.com)');
/* If we've got a cache, do the web a favour and make a
* conditional HTTP request. What this means is that if the
* server supports it, it will tell us if nothing has changed -
* this means we can reuse the cache for a while, and the
* request is returned faster.
*/
if ($cache_exists) {
curl_setopt($c, CURLOPT_TIMECONDITION, CURL_TIMECOND_IFMODSINCE);
curl_setopt($c, CURLOPT_TIMEVALUE, filemtime($cache_file));
}
/* Make the request and check the result. */
$content = curl_exec($c);
$status = curl_getinfo($c, CURLINFO_HTTP_CODE);
// Document unmodified? Return the cache file
if ($cache_exists && ($status == 304)) {
return file_get_contents($cache_file);
}
/* You could be more forgiving of errors here. I've chosen to
* fail hard instead, because at least it'll be obvious when
* something goes wrong.
*/
if ($status != 200) {
throw new Exception(sprintf('Unexpected HTTP return code %d', $status));
}
/* If everything is fine, save the new cache file, make sure
* it's world-readable, and writeable by the server
*/
file_put_contents($cache_file, $content);
chmod($cache_file, 0644);
return $content;
}
?>
Notes and known issues
- The most common reason that this code doesn't work is if your PHP installation doesn't have the cURL extension installed and enabled. If you're running a linux distribution like Debian, CentOS, or Ubuntu, there are simple commands to fetch and install the cURL extension (like
apt-get install php5-curl, oryum install php5-curl), but I can't help you with this. I'm not your sysadmin, sorry! - I fail on redirection here because if you're hitting a web service like Last.FM, redirection usually means you're doing something wrong. If you're not hitting a web service, then you might want to be more flexible. Look up the
CURLOPT_FOLLOWLOCATIONsetting in the PHP manual. - I use
CURLOPT_RETURNTRANSFERhere as a shortcut to retrieve the whole document to a string. This works fine if you're fetching something you know the size of, like a Last.FM top album chart. If you're fetching random web documents or images, you can very easily retrieve something that breaks PHP's memory_limit setting. If you're doing that kind of thing, turn offCURLOPT_RETURNTRANSFER, and open a file-handle instead, passing it toCURLOPT_FILEso thatcurl_exec()saves the content to the file instead of holding it in memory. PHP's not the right thing to use for getting large files, really, but if you have to, remember that PHP's script execution limit is usually 30 seconds. That timer is suspended while you download, and will instantly kick in whencurl_exec()returns, which means your script might download a massive file and then die without doing anything or cleaning up after itself. Hardly ideal! If you're worried that this might happen, you can use the cURL extension to make aHEADrequest instead of aGET, and check the file-size before downloading. - I'm assuming that you're making web-service requests or grabbing RSS files with this, not fetching URLs supplied by user input. Needless to say, if it's the latter, check the hell out user input before using it. You're inviting all kinds of mischief if you allow people to specify which URLs your server fetches.
- If you're fetching XML, you might want to check that the result actually parses before you overwrite the cache with broken data. Do something like
$xml = simplexml_load_string($result);before thefile_put_contents()at the end, and check that$xmlisn't false. - This code is released without warranty or guarantee of any kind. It might not work, it might delete the internet. On balance I think it's fine, but if you choose to use it, you do so at your own risk. On the up side, if you want to use it, you can! In legalese:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.