Bright Data: Robust Web Scraping with Proxy Service
Welcome to this tutorial on using Bright Data's Proxy Network with Python. While Bright Data offers high-level tools like Scraper APIs and Datasets, this guide focuses on the foundational service: the proxy network itself.
We'll cover how to integrate Bright Data's powerful proxies with the popular requests
library to perform common web scraping tasks reliably and efficiently. You'll learn how to overcome IP blocks, access content from different countries, and manage request sessions. Let's get started! 🌍
1. Why Use a Proxy for Web Scraping?
When you scrape a website, you send many requests from your computer's IP address. Websites can easily detect this unusual activity and may block your IP, show you misleading information (cloaking), or require you to solve CAPTCHAs.
A proxy server acts as an intermediary. It forwards your request to the target website using its own IP address, hiding yours. A proxy network, like Bright Data's, gives you access to a massive pool of different IPs (Datacenter, Residential, ISP, Mobile) around the globe. This allows you to:
- Avoid IP Bans and Rate Limits: By rotating through different IPs, your requests appear to come from many different users, making your scraper much harder to detect and block.
- Access Geo-Restricted Content: You can make your request appear as if it's coming from a specific country, allowing you to scrape localized pricing, content, or services.
- Improve Scalability and Reliability: A large, reliable proxy network ensures your scraper can run at scale with a high success rate.
2. Setup and Configuration
2.1. Installing Libraries
First, we'll install the requests
library to make HTTP requests and python-dotenv
to manage our API credentials securely.
#%pip install requests python-dotenv -q
2.2. Get Your Bright Data Proxy Credentials
Before we start, you need your proxy credentials from the Bright Data dashboard.
- Sign Up: Create an account at Bright Data.
- Navigate to Proxies & Scraping Infrastructure: In the dashboard, go to this section and click "Add" to create a new proxy zone.
- Choose a Network Type: For most web scraping tasks, Residential Proxies are the most effective. Select it and configure your zone.
- Get Credentials: Once the zone is created, click on it and go to the "Access parameters" tab. You will find your Host, Port, Username, and Password. The host will typically be
brd.superproxy.io
.
2.3. Configure Your .env
File
Create a file named .env
in the same directory as this notebook. Storing credentials here keeps them secure and out of your code. Add your credentials like this, replacing the placeholder values with your actual access parameters:
BRIGHTDATA_HOST='brd.superproxy.io'
BRIGHTDATA_PORT='your_port'
BRIGHTDATA_USERNAME='brd-customer-hl_xxxxxxxx-zone-your_zone_name'
BRIGHTDATA_PASSWORD='your_zone_password'
import os
import requests
from dotenv import load_dotenv
# Load environment variables from the .env file
load_dotenv()
# Retrieve credentials
host = os.getenv("BRIGHTDATA_HOST")
port = os.getenv("BRIGHTDATA_PORT")
username = os.getenv("BRIGHTDATA_USERNAME")
password = os.getenv("BRIGHTDATA_PASSWORD")
# Check if all credentials are loaded
if not all([host, port, username, password]):
raise ValueError("Proxy credentials not found in .env file. Please check your configuration.")
# Construct the proxy URL for the requests library
proxy_url = f"http://{username}:{password}@{host}:{port}"
proxies = {
"http": proxy_url,
"https": proxy_url
}
print("✅ Proxy credentials loaded successfully!")
✅ Proxy credentials loaded successfully!
3. Making a Basic Proxied Request
Let's test our setup. We will make a request to https://geo.brdtest.com/mygeo.json
, a Bright Data service that returns geo-location details about the IP address of the incoming request. First, we'll check our real IP, and then we'll make the same request through the proxy to see the IP change.
target_url = 'https://geo.brdtest.com/mygeo.json'
response_local = requests.get(target_url)
response_local.raise_for_status() # Raise an exception for bad status codes
local_ip = response_local.json().get('geo')
local_ip
{'city': 'Singapore', 'region': '', 'region_name': '', 'postal_code': '31', 'latitude': 1.3352, 'longitude': 103.8529, 'tz': 'Asia/Singapore', 'lum_city': 'singapore'}
from IPython.display import display, Markdown
target_url = 'https://geo.brdtest.com/mygeo.json'
# 1. Request with your local IP
try:
print("Requesting without a proxy...")
response_local = requests.get(target_url)
response_local.raise_for_status() # Raise an exception for bad status codes
local_data = response_local.json().get('geo')
display(f"🌍 Your Local Geo: {local_data}")
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
print("-" * 30)
# 2. Request with the Bright Data proxy
try:
print("Requesting with a Bright Data proxy...")
# The 'proxies' dict tells requests to route the call through the proxy.
# `verify=False` is used here like the -k flag in cURL to handle SSL certificates via the proxy.
response_proxy = requests.get(target_url, proxies=proxies, verify=False)
response_proxy.raise_for_status()
proxy_data = response_proxy.json().get('geo')
display(f"🕵️ Your Proxy Geo: {proxy_data}")
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
Requesting without a proxy...
"🌍 Your Local Geo: {'city': 'Singapore', 'region': '', 'region_name': '', 'postal_code': '31', 'latitude': 1.3352, 'longitude': 103.8529, 'tz': 'Asia/Singapore', 'lum_city': 'singapore'}"
------------------------------ Requesting with a Bright Data proxy...
"🕵️ Your Proxy Geo: {'city': 'London', 'region': 'ENG', 'region_name': 'England', 'postal_code': 'EC4R', 'latitude': 51.5164, 'longitude': -0.093, 'tz': 'Europe/London', 'lum_city': 'london', 'lum_region': 'eng'}"
4. Advanced Usage: Geo-Targeting
One of the most powerful features of a proxy network is geo-targeting. You can make requests appear from virtually any country. With Bright Data, this is done by adding a country parameter to your proxy username.
The format is username-country-COUNTRYCODE
. For example, to use a German IP, your username would become your_username-country-de
.
def get_proxied_ip(country_code=None):
"""Fetches the origin IP through a potentially geo-targeted proxy."""
proxy_user_geo = username
if country_code:
proxy_user_geo += f"-country-{country_code.lower()}"
geo_proxy_url = f"http://{proxy_user_geo}:{password}@{host}:{port}"
geo_proxies = {
'http': geo_proxy_url,
'https': geo_proxy_url
}
try:
print(f"Requesting with a proxy from country: {country_code or 'Any (Rotating)'}...")
response = requests.get(target_url, proxies=geo_proxies, timeout=10, verify=False)
response.raise_for_status()
data = response.json()
display(f" -> 🕵️ Proxy Data: {data.get('geo')}")
except requests.exceptions.RequestException as e:
print(f" -> An error occurred: {e}\n")
# Example: Request from Germany (DE) and Canada (CA)
get_proxied_ip(country_code="de")
get_proxied_ip(country_code="ca")
Requesting with a proxy from country: de...
" -> 🕵️ Proxy Data: {'city': 'Düsseldorf', 'region': 'NW', 'region_name': 'North Rhine-Westphalia', 'postal_code': '40468', 'latitude': 51.2562, 'longitude': 6.7827, 'tz': 'Europe/Berlin', 'lum_city': 'dusseldorf', 'lum_region': 'nw'}"
Requesting with a proxy from country: ca...
" -> 🕵️ Proxy Data: {'city': 'Mississauga', 'region': 'ON', 'region_name': 'Ontario', 'postal_code': 'L5A', 'latitude': 43.5873, 'longitude': -79.614, 'tz': 'America/Toronto', 'lum_city': 'mississauga', 'lum_region': 'on'}"
5. Advanced Usage: Sticky Sessions
By default, each request you send through a residential proxy zone might use a different IP. This is great for avoiding blocks. However, sometimes you need to maintain the same IP across multiple requests, for example, when navigating a multi-page form or a shopping cart.
This is called a "sticky session." To use one, you add a session ID parameter to your username: username-session-SESSIONID
. The SESSIONID
can be any random string or number you choose. All requests using the same session ID will be routed through the same IP address.
import random
import requests
# Simplified sticky session test using a single IP echo endpoint.
IP_ENDPOINT = "https://api.ipify.org?format=json"
def test_session_ip(session_id: int, attempts: int = 2, timeout: int = 10):
"""Check whether the same proxy IP is kept across multiple requests using a session ID.
Args:
session_id: Arbitrary integer/str to pin the session.
attempts: How many requests to make (default 2 for a simple comparison).
timeout: Seconds before timing out each request.
"""
session_username = f"{username}-session-{session_id}"
session_proxy_url = f"http://{session_username}:{password}@{host}:{port}"
session_proxies = {
'http': session_proxy_url,
'https': session_proxy_url
}
display(f"--- Testing Sticky Session (Session ID: {session_id}) ---")
ips = []
for i in range(attempts):
try:
display(f" Request #{i+1} -> querying {IP_ENDPOINT}")
resp = requests.get(IP_ENDPOINT, proxies=session_proxies, timeout=timeout, verify=False)
resp.raise_for_status()
ip = resp.json().get('ip')
display(f" Returned IP: {ip}")
ips.append(ip)
except Exception as e:
display(f" Error: {e}")
ips.append(None)
# Simple evaluation
if len(ips) >= 2 and all(ips) and len(set(ips)) == 1:
display("✅ Sticky success: All requests used the same IP.")
else:
display("❌ Not sticky (or undetermined). IPs observed:")
for idx, ip in enumerate(ips, start=1):
display(f" Attempt {idx}: {ip}")
display(" (Different or missing IPs can mean rotation is enforced or the request failed.)")
# Run the simplified test
random_session_id = random.randint(100000, 999999)
test_session_ip(random_session_id)
'--- Testing Sticky Session (Session ID: 648044) ---'
' Request #1 -> querying https://api.ipify.org?format=json'
' Returned IP: 45.185.133.250'
' Request #2 -> querying https://api.ipify.org?format=json'
' Returned IP: 45.185.133.250'
'✅ Sticky success: All requests used the same IP.'
6. Best Practices and Conclusion
You have successfully configured and used the Bright Data Proxy Network with Python! You've learned to mask your IP, target specific countries, and maintain sessions.
To make your scrapers even more robust, always remember to:
- Set Realistic Headers: Besides changing your IP, you should also set a
User-Agent
header to mimic a real web browser. This is a crucial step to avoid being identified as a bot. - Implement Error Handling: Network requests can fail. Always wrap your requests in
try...except
blocks to handle potential timeouts, connection errors, or bad HTTP status codes gracefully. - Respect
robots.txt
: Be a good internet citizen. Check a website'srobots.txt
file (e.g.,example.com/robots.txt
) for rules about which parts of the site should not be accessed by automated programs.
This tutorial provides a solid foundation for building powerful and resilient web scrapers. To explore more advanced features or different proxy types, check out the official Bright Data documentation.
Happy scraping! 🎉