The SEC EDGAR API Is the Best Free Dataset in Finance

Every public company in the United States is required to file electronically with the SEC. All of that data — 10-K annual reports, 13F institutional holdings, Form 4 insider transactions, 8-K material events — is freely accessible via the EDGAR full-text search API. Most quant PMs pay thousands per month for data they could pull for free.

This tutorial shows you exactly how to extract structured financial data from SEC EDGAR using Python — no API key required, no Bloomberg terminal, no third-party data vendor.

Get this data automatically: Quantscope runs daily EDGAR extraction across your watchlist and delivers factor signals to your inbox. Start free →

EDGAR API Basics: Three Endpoints You Need to Know

The EDGAR system exposes several machine-readable endpoints. The three most useful for quant research are:

All three return JSON. The SEC imposes a rate limit of 10 requests per second with a proper User-Agent header — always include one, or your requests will be blocked.

Free Beta Access

Get daily AI-powered quant signals — 0-cost beta

SEC filing alerts, insider clusters, factor regime shifts — in your inbox before market open.

Step 1: Set Up Your Python Environment

You need three libraries: requests for HTTP calls, pandas for data handling, and optionally beautifulsoup4 for parsing raw filing HTML.

pip install requests pandas beautifulsoup4

Always set a descriptive User-Agent to avoid rate limiting:

HEADERS = {
    "User-Agent": "YourFundName research@yourfund.com",
    "Accept-Encoding": "gzip, deflate",
}

Step 2: Pull 13F Institutional Holdings

13F filings reveal every stock held by institutional managers with $100M+ in equity AUM. Filed quarterly, they are one of the most actionable signals for understanding institutional positioning.

import requests
import pandas as pd

BASE = "https://data.sec.gov"
HEADERS = {"User-Agent": "QuantResearch research@example.com"}

def get_cik(ticker: str) -> str:
    """Map ticker symbol to SEC CIK number."""
    url = "https://efts.sec.gov/LATEST/search-index?q=%22{}%22&forms=SC+13G".format(ticker)
    # Use EDGAR company tickers JSON for direct lookup
    tickers_url = "https://www.sec.gov/files/company_tickers.json"
    resp = requests.get(tickers_url, headers=HEADERS)
    tickers = resp.json()
    for item in tickers.values():
        if item["ticker"].upper() == ticker.upper():
            return str(item["cik_str"]).zfill(10)
    return None

def get_submissions(cik: str) -> dict:
    """Fetch all recent filings for a given CIK."""
    url = f"{BASE}/submissions/CIK{cik}.json"
    return requests.get(url, headers=HEADERS).json()

def find_13f_filings(cik: str, limit: int = 4) -> list:
    """Return the most recent 13F-HR filing accession numbers."""
    data = get_submissions(cik)
    recent = data.get("filings", {}).get("recent", {})
    forms = recent.get("form", [])
    accessions = recent.get("accessionNumber", [])
    dates = recent.get("filingDate", [])
    results = []
    for form, acc, date in zip(forms, accessions, dates):
        if "13F" in form:
            results.append({"accession": acc, "date": date})
        if len(results) == limit:
            break
    return results

Step 3: Extract XBRL Financial Data (10-K / 10-Q)

The XBRL financial data endpoint is the fastest way to pull structured fundamentals. It returns every reported financial concept for a company — revenue, net income, earnings per share, total assets — going back to the first electronic filing.

def get_xbrl_facts(cik: str) -> dict:
    """Pull all reported XBRL financial facts for a company."""
    url = f"{BASE}/api/xbrl/companyfacts/CIK{cik}.json"
    resp = requests.get(url, headers=HEADERS)
    return resp.json()

def extract_revenue(cik: str) -> pd.DataFrame:
    """Return annual revenue time series."""
    facts = get_xbrl_facts(cik)
    us_gaap = facts.get("facts", {}).get("us-gaap", {})
    
    # Revenue can be under multiple XBRL tags
    for tag in ["Revenues", "RevenueFromContractWithCustomerExcludingAssessedTax", "SalesRevenueNet"]:
        if tag in us_gaap:
            units = us_gaap[tag].get("units", {}).get("USD", [])
            df = pd.DataFrame(units)
            # Filter for annual 10-K filings only
            df = df[df["form"] == "10-K"].copy()
            df["end"] = pd.to_datetime(df["end"])
            return df[["end", "val"]].rename(columns={"val": "revenue_usd"}).sort_values("end")
    
    return pd.DataFrame()

# Example: Apple Inc. CIK
apple_cik = "0000320193"
revenue = extract_revenue(apple_cik)
print(revenue.tail(5))

Step 4: Track Form 4 Insider Transactions

Form 4 must be filed within two business days of an insider transaction. This timeliness makes it one of the highest-frequency signals you can monitor. Studies show insider buying generates 3-7% excess returns annually when filtered for high-conviction signals (cluster buying across multiple insiders, purchases above $500K).

def get_form4_filings(cik: str, days_back: int = 90) -> list:
    """Return recent Form 4 filings for a company."""
    from datetime import datetime, timedelta
    cutoff = (datetime.now() - timedelta(days=days_back)).strftime("%Y-%m-%d")
    
    data = get_submissions(cik)
    recent = data.get("filings", {}).get("recent", {})
    forms = recent.get("form", [])
    accessions = recent.get("accessionNumber", [])
    dates = recent.get("filingDate", [])
    
    results = []
    for form, acc, date in zip(forms, accessions, dates):
        if form == "4" and date >= cutoff:
            results.append({"form": form, "accession": acc, "date": date})
    
    return results

Step 5: Rate Limiting and Production Patterns

The SEC enforces a 10 req/s rate limit. For production workflows scanning hundreds of tickers, implement exponential backoff and a request queue:

import time
from functools import wraps

def rate_limited(max_per_second):
    min_interval = 1.0 / max_per_second
    last_called = [0.0]
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait = min_interval - elapsed
            if wait > 0:
                time.sleep(wait)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limited(8)  # Stay under 10/s
def safe_get(url: str) -> dict:
    return requests.get(url, headers=HEADERS).json()

What to Do With This Data

Raw EDGAR data is only valuable when transformed into factors. Once you have the extraction pipeline running, the next step is building a scoring model. For most RIAs, three signals drive the majority of the alpha:

Quantscope runs this entire pipeline automatically — EDGAR ingestion, XBRL parsing, insider scoring, and daily delivery to your inbox. If you want the output without the Python overhead, sign up below for a free daily brief.

Related Research

Want this data without writing a single line of Python? Get your free Quantscope watchlist →