Skip to content
  • Pricing
Sign inContact sales
Blog

Category

Engineering

Written by

Beau Rothrock
Beau Rothrock
Senior Data Engineer
Copied link
Blog
Engineering

Everyone Is a Data Analyst Now

How AngelList decomposed its data stack into knowledge and skills that anyone can use.

Jun 10, 2026 — 15 min read

Written by

Beau Rothrock
Beau Rothrock
Senior Data Engineer
Copied link

Last week, someone on our team asked in Slack: "@Dana can you total ICV by product and by year starting from 2019?" Within three minutes, Dana (our AI data analyst, powered by Devin) had queried Snowflake, returned a breakdown of Investment Capital Volume across every product line, and was asked two follow-up questions: whether certain product lines were double-counted, and whether there was overlap between two of our platforms in the numbers. The whole conversation took about ten minutes. No ticket filed, no context switching, no waiting for a human analyst to free up.

That same day, Dana also answered "what is the revenue we make from crypto funds?", exported a breakdown of entity types across all active funds to CSV, and built a dashboard in Metabase with pivot tables broken down by fund type.

Nobody taught these people SQL. They just asked a question in plain English.

This is what the modern data stack looks like at AngelList: a pipeline designed not just to move data, but to make it legible to anyone who can write a Slack message.

The pipeline

AngelList administers $200B+ in assets on platform across 25k+ investment vehicles with a team of fewer than 50 engineers. [1]

The entire data pipeline is built and maintained by three people: myself, Ryan Kelly, and Hiren Pasad. Ryan and Hiren author and maintain the business-specific mart models that power reporting across the company. The data that describes those vehicles, the money flowing through them, and the people involved comes from everywhere: MySQL and PostgreSQL databases backing our Rails and Next.js applications, third-party SaaS tools, payment processors, regulatory filings, and even newsletter platforms.

Here is how that data moves from source to answer:

Each stage does one job well.

Sources and ingestion

Data enters through two doors. Fivetran handles the managed connectors: Salesforce, Stripe, Notion, Outreach, and a growing list of others. For sources where we need custom logic, such as Front's API, SEC EDGAR filings, and Stripe's asynchronous reporting endpoints, we run Airflow DAGs from our data-orchestration repo, hosted on Astronomer. These DAGs extract data via API, stage it in S3, and load it into Snowflake using idempotent upsert patterns.

Between Fivetran and Airflow, we ingest from 38 distinct source systems into Snowflake. That includes 8 internal production databases and third-party services (Salesforce, Stripe, Front, SEC EDGAR, Segment, Customer.io, Google Sheets, DynamoDB, and others).

Transformation with dbt

Once data lands in Snowflake, dbt transforms it through a layered architecture:

  • Sources define raw table contracts in YAML.
  • Staging models (755 SQL files across 38 source folders) clean and rename columns, one model per source table.
  • Intermediate models handle complex joins, entity resolution, and shared business logic.
  • Mart models (1,402 SQL files across 59 domains) produce the final business-facing tables that people actually query.

The mart layer covers everything: finance, operations, product analytics, sales pipeline, compliance, intelligence reporting, fund performance, and more. In total, the catalog documents 27,185 columns across those 1,402 models, with an additional 11,910 columns enriched from Snowflake's information schema.

We recently migrated from dbt Core to dbt Fusion for local CLI operations. Fusion is faster for parse and compile, which matters when you have 2,248 SQL files and want tight feedback loops during development.

The consumption layer: where humans (and agents) meet data

The traditional modern data stack story ends at "and then analysts write SQL in a BI tool." Ours doesn't.

Metabase serves as our primary BI tool. Teams across AngelList use it for dashboards, saved questions, and ad-hoc exploration. The Metabase instance connects to Snowflake's analytics database, so every mart model is queryable through a familiar interface.

Streamlit in Snowflake handles cases where we need more interactivity than Metabase provides: custom visualizations, complex filters, or dashboards that require Python logic. These apps run directly in Snowflake, query data through Snowpark sessions, and are deployed via the Snowflake CLI.

But the real shift happened when we connected AI agents to the same data layer.

The agent that changed the game

Dana is our AI data analyst. When someone mentions @Dana in our #ask-data Slack channel, it triggers a Devin session that follows a structured skill called query-warehouse. Here is what happens under the hood:

  1. Table discovery. The agent reads a 13,000-line model catalog (auto-generated from the dbt manifest) to find the right table. The catalog includes model names, descriptions, schemas, upstream dependencies, and column counts.
  2. Query construction. Using a knowledge file called analytics_knowledge.md, the agent knows the SQL dialect rules (Snowflake), the table selection gotchas (e.g., "use product.campaigns for campaign data, not staging_angellist.angellist_fundraising_campaigns_mb"), and the metric definitions from our business glossary.
  3. Query execution. The agent runs the query through a Metabase MCP integration, gets the results, and generates a playground link so the requester can modify the query themselves.
  4. Answer presentation. The agent returns the result with the SQL, a chart if helpful, and the Metabase playground link.

Dana has handled over 565 sessions so far. The range of questions is remarkable. In just the past two weeks:

  • "What is the count of unique startups that have launched rollups where geo is not India?" (Answer: a filtered count with follow-up clarification on what "launched" means)
  • "Give me a list of entity types across all active funds" (Answer: exported to CSV with breakdowns by fund structure)
  • "Help me make a dashboard for the funds division" (Answer: a complete Metabase dashboard with ICV/ARR time series and incident tracking)
  • "What are the Form D filings over $25M?" (Answer: pulled directly from the SEC EDGAR API, since that data wasn't in Snowflake yet)
  • "Can you total ICV by product and by year starting from 2019?" (Answer: full breakdown with follow-up clarification about product overlap)

Some of these are simple lookups. Others involve multi-step reasoning, follow-up questions from the requester, or creating entirely new Metabase dashboards with multiple cards. Dana handles all of them.

The key insight is that Dana isn't magic. It's structured. It follows a documented procedure, consults a curated knowledge base, and uses the same tables and SQL dialect that a human analyst would. The difference is that it's available to anyone in the company, at any hour, without needing to know which table to query or how Snowflake's IFF() function works.

The decomposition: knowledge and skills

What makes this work isn't the AI model. It's the decomposition of institutional knowledge into two primitives: knowledge and skills.

Knowledge is reference material. It's the business glossary that defines what "customer" means at AngelList (it's ambiguous: do you mean GPs or LPs?), what "ICV" stands for (Investment Capital Volume), or how "bookings" are calculated. It's the analytics knowledge file that maps staging folders to source repos. It's the model catalog that tells you product.campaigns has 80+ columns covering everything from IS_AL_ADVISED to CARRY_BOOKINGS.

Dana also builds its own knowledge over time. When it answers a question about a new data domain (say, SEC Form D filings or Treasury crypto deposits), it captures what it learned as a knowledge note so future sessions don't start from scratch. This is how institutional knowledge compounds instead of evaporating after each conversation.

Skills are procedural. They are step-by-step instructions for completing a specific task. The query-warehouse skill tells the agent exactly how to find a table, write a query, execute it, and present the result. The add-dbt-model skill walks through creating a new staging or mart model with proper SQL, YAML schema, and test configuration. The create-streamlit-app skill handles everything from data discovery to Snowflake deployment. There are also skills for managing Metabase cards, dashboards, and collections via MCP.

The data-tooling repo currently defines 11 skills and multiple knowledge files. Together, they form a decision tree: an inbound question routes through AGENTS.md to the right skill, which consults the right knowledge files, and produces a result using the right integrations (Metabase MCP, Snowflake, dbt).

This decomposition is what enables self-service analytics. When someone asks "what's our take rate by vehicle type this quarter?", they don't need to know that the answer lives in product.campaigns, that TAKE_RATE is defined as revenue divided by ICV, or that they should exclude test deals and INR-denominated campaigns. The skill knows. The knowledge defines it. The agent executes it.

Keeping knowledge current: the self-updating catalog

Knowledge that goes stale is worse than no knowledge at all. So we automated it.

When anyone pushes a change to a dbt model on the snowflake branch, a GitHub Actions workflow fires. It runs dbtf parse to rebuild the manifest, then executes a Python script (generate_model_catalog.py) that extracts every mart model's metadata into a structured catalog. The workflow commits the updated catalog files back to the repo automatically.

This means the catalog that Dana consults is always in sync with the actual dbt models. Add a new mart? The catalog updates on merge. Rename a column? The column details file reflects it by the next CI run. No manual wiki maintenance, no stale documentation.

The catalog generator parses the dbt manifest, queries Snowflake's information schema for column metadata that dbt doesn't track (like actual data types from the warehouse), cross-references the two, and produces a three-tier output: an index file, per-domain column detail files for mart models, and a separate file for staging-prefixed mart domains.

We also run a column enrichment process that pulls semantic context from upstream application repos. It reads Rails schema files, Prisma models, and Go SQL definitions to fill in column descriptions that the source systems define but that dbt models don't automatically inherit. This gives the agent richer context when deciding which columns to use in a query.

What this looks like in practice

Here is a real example from last week.

Someone in #ask-data asked Dana to create a dashboard showing executed distributions broken down by five criteria (distributions with capital calls, distributions with negative amounts, distributions with true-ups, and so on). Dana queried the FCT_COMPTROLLER_DISTRIBUTIONS table, built four of the five Metabase cards, assembled them into a new dashboard, and then asked a clarifying question: "I searched across all comptroller distribution tables and there's no explicit 'true-up' flag. How should I identify those?" It knew to ask rather than guess.

In another session, someone asked for the count of unique startups that launched rollups outside India. Dana returned an initial number. The requester pushed back: "I'm asking about any with investor interest, not just signed or completed." Dana re-ran the query with all non-zero commitment closings regardless of state and returned a higher count. Then a follow-up: "Can you break this down by year?" Done.

This is the pattern. Someone asks a question. The agent finds the table, writes the SQL, runs it, and presents the answer. When the question is ambiguous, it asks for clarification. When the requester wants a different cut, it iterates. When the data isn't in Snowflake (like SEC filings), it finds an alternative source.

What comes next

The most obvious gap right now is that Dana knows things we don't. Every session that resolves an ambiguous table or discovers a new metric definition generates a knowledge note, but those notes live in session state rather than version control. We will build an agent that scans accumulated Dana knowledge on a regular basis and proposes PRs to integrate it back into the canonical corpus, so that what Dana learns compounds into the shared knowledge base rather than evaporating.

On the data quality side, we will build agents that periodically comb the dbt model catalog looking for redundant, ambiguous, or conflicting definitions and propose PRs for the data team to review. At 1,400+ mart models, consistency drift is inevitable; automated review is the only way to catch it at scale.

Finally, we will strengthen the corpus of skills available to Claude directly, with the goal of bringing its analytics abilities closer to parity with Dana. The knowledge-and-skills architecture isn't specific to any one agent; it's a pattern that any model can follow if the documentation is good enough.

Three people maintain the pipeline. Hundreds of people across the company ask questions of it every week. When business analytics nearly always begins with a human-written prompt to an agent, the bottleneck moves from "who has access to the data" to "who is asking the right questions." That's a better bottleneck to have.

Beau Rothrock works on the data platform at AngelList. If building systems that bridge AI, data, and finance sounds interesting, check out our open roles.

Footnotes

  1. 1.
    "Assets on platform" refers to the amount of money being deployed by fund managers who use AngelList's software, which includes fund administration services. This does not refer to any amount of money being deployed with or managed by Platform Advisor, LLC. ↩

Latest articles

Data

Is Venture Capital Intrinsically Cyclical?

Jun 8, 2026 — 16 min read
Engineering

The Interview That Ships to Production

May 15, 2026 — 7 min read
Engineering

Version Control for Financial Reality

Apr 1, 2026 — 11 min read
;
Contact salesSign in

Products

Fund Administration

  • Venture Funds
  • Rolling Funds
  • Scout Funds
  • SPVs
  • Roll Up Vehicles

Investor Management

  • Digital Subscriptions
  • Data Room

Pricing + Returns

  • Pricing
  • VC Fund Performance Calculator
  • RUV Calculator

Resources

Learn

  • Blog
  • Help Center
  • Education Center
  • Data Center

Company

  • About Us
  • Careers
  • Engineering

By AngelList

  • Rollups
  • Meridian
TermsPrivacyDisclosures© AL Advisors Management Inc.
Disclaimer:

The information contained herein is provided for informational and discussion purposes only and is not intended to be a recommendation for any investment, service, product, or other advice of any kind, and shall not constitute or imply an offer of any kind. Any investment opportunities and/or products or services shown here will only be completed pursuant to formal offering materials, a letter of intent, and/or any other agreements as determined by AngelList containing full details regarding risks, minimum investment, fees, and expenses of such transaction. The terms of any product, service, or particular investment opportunity, including size, costs, and other characteristics, are set forth in the applicable constituent documents for such product, service or particular investment opportunity and may differ materially from those presented in this presentation. Such terms are subject to change without notice. For more information on AngelList and its products and services, please see here.

Quotes included in these materials related to AngelList's services should not be construed in any way as an endorsement of AngelList's advice, analysis, or other service rendered to its clients.