BrightData: Structured Web Data Extraction with LangChain
Welcome to this advanced tutorial on using BrightData's Scraper APIs within the LangChain framework.
Instead of just scraping raw HTML, we'll learn how to extract clean, structured JSON data directly from complex websites like LinkedIn. We'll then use a Large Language Model (LLM) to interpret and act upon this structured data.
1. Core Concepts: Beyond Raw Scraping
BrightData Scraper APIs are powerful, pre-built collectors designed to pull specific types of data from major websites. For example, instead of writing a scraper for a LinkedIn profile, you simply tell the API, "get me the profile data from this URL," and it returns a clean JSON object.
The langchain-brightdata
integration packages this functionality into a LangChain Tool. A Tool is a component that an LLM can use to interact with the outside world. In our case, the BrightDataWebScraperAPI
tool allows an AI agent to look up structured data from the web.
2. Setup
First, let's install the necessary libraries. We'll need langchain
, the BrightData and OpenAI integrations, and python-dotenv
for our API keys.
# Uncomment to install the required packages
# !pip install langchain langchain-brightdata langchain-openai python-dotenv
Next, create a file named .env
in the same directory as this notebook. This file will securely store your API keys.
Your .env
file should look like this:
BRIGHT_DATA_API_KEY=your_brightdata_api_key
OPENAI_API_KEY=your_openai_api_key
Note: The variable name in the .env
file is BRIGHT_DATA_API_KEY
, which the tool will look for automatically.
You can find your BRIGHT_DATA_API_KEY
in the Bright Data dashboard
3. Initializing and Using the Scraper Tool
Here's the code you provided. We'll load our API keys and initialize the BrightDataWebScraperAPI
as a tool. We then invoke it to get structured data from a LinkedIn profile. The dataset_type
parameter is crucial—it tells BrightData what kind of data to extract.
import os
import json
from dotenv import load_dotenv
from langchain_brightdata import BrightDataWebScraperAPI
# Load API keys from .env file
load_dotenv()
# The tool automatically finds the BRIGHT_DATA_API_KEY in your environment variables
scraper_tool = BrightDataWebScraperAPI()
# Invoke the tool to get structured data from a LinkedIn profile
print("Extracting data from LinkedIn profile...")
linkedin_data = scraper_tool.invoke(
{"url": "https://www.linkedin.com/in/williamhgates/", "dataset_type": "linkedin_person_profile"}
)
# Remove unwanted fields from the result before storing
fields_to_remove = ['people_also_viewed', 'activity', 'similar_profiles']
if isinstance(linkedin_data, list) and len(linkedin_data) > 0 and isinstance(linkedin_data[0], dict):
data_to_print = dict(linkedin_data[0])
elif isinstance(linkedin_data, dict):
data_to_print = dict(linkedin_data)
else:
data_to_print = {}
for field in fields_to_remove:
data_to_print.pop(field, None)
# Store the cleaned data for later use
profile_data = data_to_print
print("Data extraction completed successfully!")
Extracting data from LinkedIn profile... Data extraction completed successfully!
print("Profile Data:")
print(json.dumps(profile_data, indent=2))
Profile Data: { "id": "williamhgates", "name": "Bill Gates", "city": "Seattle, Washington, United States", "country_code": "US", "position": "Chair, Gates Foundation and Founder, Breakthrough Energy", "about": "Chair of the Gates Foundation. Founder of Breakthrough Energy. Co-founder of Microsoft. Voracious reader. Avid traveler. Active blogger.", "current_company": { "name": "Gates Foundation", "company_id": "gates-foundation", "title": "Co-chair", "location": null }, "experience": [ { "title": "Co-chair", "description_html": null, "start_date": "2000", "end_date": "Present", "company": "Gates Foundation", "company_id": "gates-foundation", "url": "https://www.linkedin.com/company/gates-foundation", "company_logo_url": "https://media.licdn.com/dms/image/v2/D560BAQEgMqqFTd40Tg/company-logo_100_100/company-logo_100_100/0/1736784969376/bill__melinda_gates_foundation_logo?e=2147483647&v=beta&t=2JH2cMcZms60vPAMbvVZyMeYXosQ1Jjy5axDlyeQ1Ww" }, { "title": "Founder", "description_html": null, "start_date": "2015", "end_date": "Present", "company": "Breakthrough Energy", "company_id": "breakthrough-energy", "url": "https://www.linkedin.com/company/breakthrough-energy", "company_logo_url": "https://media.licdn.com/dms/image/v2/C4D0BAQGwD9vNu044FA/company-logo_100_100/company-logo_100_100/0/1630531940051/breakthrough_energy_ventures_logo?e=2147483647&v=beta&t=nL8eeluwraYnfTTnHApCodLZnaRGV8WtyNeFI_XhJ-M" }, { "title": "Co-founder", "description_html": null, "start_date": "1975", "end_date": "Present", "company": "Microsoft", "company_id": "microsoft", "url": "https://www.linkedin.com/company/microsoft", "company_logo_url": "https://media.licdn.com/dms/image/v2/D560BAQH32RJQCl3dDQ/company-logo_100_100/B56ZYQ0mrGGoAU-/0/1744038948046/microsoft_logo?e=2147483647&v=beta&t=rr_7_bFRKp6umQxIHErPOZHtR8dMPIYeTjlKFdotJBY" } ], "url": "https://www.linkedin.com/in/williamhgates/", "educations_details": "Harvard University", "education": [ { "title": "Harvard University", "url": "https://www.linkedin.com/school/harvard-university/?trk=public_profile_school_profile-section-card_image-click", "start_year": "1973", "end_year": "1975", "description": null, "description_html": null, "institute_logo_url": "https://media.licdn.com/dms/image/v2/C4E0BAQF5t62bcL0e9g/company-logo_100_100/company-logo_100_100/0/1631318058235?e=2147483647&v=beta&t=Ye1klXowyo8TIcnkhTlmORgiA5ZywvooNihDMnx5urQ" }, { "title": "Lakeside School", "url": "https://www.linkedin.com/school/lakeside-school/?trk=public_profile_school_profile-section-card_image-click", "description": null, "description_html": null, "institute_logo_url": "https://media.licdn.com/dms/image/v2/D560BAQGFmOQmzpxg9A/company-logo_100_100/company-logo_100_100/0/1683732883164/lakeside_school_logo?e=2147483647&v=beta&t=EmadOLH7MckKZvCCrgmAOikCRtzVRtqqN4PJi35CNyo" } ], "avatar": "https://media.licdn.com/dms/image/v2/D5603AQF-RYZP55jmXA/profile-displayphoto-shrink_200_200/B56ZRi8g.aGsAY-/0/1736826818802?e=2147483647&v=beta&t=bKWfN6UwwtiCqFWsG7rBELbd48qJOAMLdxhBzzkJV0k", "followers": 38381729, "connections": 8, "current_company_company_id": "gates-foundation", "current_company_name": "Gates Foundation", "location": "Seattle", "input_url": "https://www.linkedin.com/in/williamhgates/", "linkedin_id": "williamhgates", "linkedin_num_id": "251749025", "banner_image": "https://media.licdn.com/dms/image/v2/D5616AQEjhPbTCeblYg/profile-displaybackgroundimage-shrink_200_800/B56ZcytR5SGsAc-/0/1748902420393?e=2147483647&v=beta&t=a-tBeZkxzWTHWYY6MAjxt0oTEuxlW33EUkK3gm5_te4", "honors_and_awards": null, "default_avatar": false, "memorialized_account": false, "bio_links": [ { "title": "Blog", "link": "https://gatesnot.es/sourcecode-li" } ], "first_name": "Bill", "last_name": "Gates", "timestamp": "2025-08-06T04:01:50.054Z", "input": { "url": "https://www.linkedin.com/in/williamhgates/" } }
4. Processing Structured Data with an LLM
Now for the fun part! We have clean, structured data. We don't need to parse HTML or clean up text. We can feed this data directly to an LLM for intelligent processing.
Let's ask an LLM to write a short, professional biography based on the structured data we just extracted.
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from IPython.display import display, Markdown
# 1. Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
# 2. Create a Prompt Template that accepts our structured data
prompt_template = PromptTemplate(
input_variables=["profile_json"],
template="""
Based on the following JSON data from a LinkedIn profile, please write a concise, one-paragraph professional biography for this person within 50 words.
JSON Data:
{profile_json}
Professional Biography:
"""
)
# 3. Create an LLM Chain
bio_chain = LLMChain(llm=llm, prompt=prompt_template)
# 4. Run the chain with our extracted data
print("Generating biography with LLM...")
professional_bio = bio_chain.run(profile_json=json.dumps(profile_data, indent=2))
display(Markdown(professional_bio))
Generating biography with LLM...
Bill Gates, Chair of the Gates Foundation and Founder of Breakthrough Energy, is a renowned philanthropist and technology pioneer. As the co-founder of Microsoft, he has revolutionized the digital landscape. With a passion for innovation and social impact, Gates continues to drive positive change worldwide.
5. Conclusion
In this notebook, we leveled up our data collection strategy. By using BrightData's Scraper APIs through LangChain, we completely bypassed the messy step of parsing raw HTML. We jumped straight from a URL to clean, structured JSON data, which we then used to power an intelligent LLM task.
This workflow is incredibly efficient for tasks involving:
- Recruitment: Analyzing candidate profiles.
- Market Research: Aggregating product data from e-commerce sites.
- Lead Generation: Collecting information about companies and their key personnel.
Learn more about Bright Data's AI offerings at Bright Data AI offerings.