A team of researchers led by Yikang Li at the Shanghai Innovation Institute, with co-authors from Westlake University, Shanghai Jiao Tong, Harbin Institute of Technology, and Fudan, published a dataset on GitHub that is, to my knowledge, the largest public corpus of prediction-market trading activity ever released. 1.1 billion trading records. 268,706 markets. 107 GB across five parquet files. Every OrderFilled event on Polymarket, cross-verified against Polygon RPC nodes, MIT-licensed, free for commercial and research use.
I spent a few hours this week poking through what's actually in it. I want to write a straight take — where it matters, where it falls short, and what it probably does to the category.
What's actually in there
The release is five parquet files covering complementary views of the same underlying activity:
- orderfilled.parquet — 31 GB of raw on-chain OrderFilled events, 293 million of them.
- trades.parquet — 32 GB of processed trades with market metadata attached.
- markets.parquet — 68 MB of market records, all 268,706 of them, the lightest file in the pack.
- quant.parquet — 21 GB of trades unified to a single YES-token perspective. 170 million records after the authors filter out contract-to-contract flows.
- users.parquet — 23 GB of per-user maker/taker behavior. 340 million user rows.
The collection toolkit is open-source and the authors claim a daily update pipeline and no missing blocks, cross-checked against two official exchange contracts on Polygon. That reconciliation guarantee is unusual at this scale. Historically the hardest part of working with on-chain data has not been collection; it has been trusting that the collection didn't drop something.
Why this is genuinely a big deal
A category does not really have academic legibility until somebody publishes the dataset. Equities got CRSP in 1960, and sixty years later most of quantitative finance still runs on it. Crypto got Santiment, Glassnode, and Kaiko over the 2018-2020 stretch, and that unlocked the entire wave of systematic crypto research. Prediction markets, until a few weeks ago, had none of this. You could scrape Polymarket's API, but you couldn't backtest a strategy that required every trade in the venue's history without writing your own ingestion pipeline and eating the egress bill.
The SII team just did that for everybody. MIT license. You can download it tonight, open a notebook, and know more about how prediction markets actually clear than the average institutional desk does. That's a real unlock for academic research, for model training, and for anybody trying to build a product in this space. It is also, more quietly, a real unlock for the adversarial stuff.
What it unlocks for retail
The charitable read first. If you are a retail trader who actually wants to study prediction markets, this is the best thing that has happened to your workflow in a year.
You can now do things that, until now, required a market-data subscription or a custom scraping setup: measure real bid-ask spread distributions by market category, look at how often whale orders get front-run and by how much, study the cross-section of user performance and figure out what actually separates the top decile, calibrate your own model against every Fed-decision market Polymarket has ever listed. None of this was accessible before. It is now, for the cost of a large hard drive.
This is the same kind of leverage shift retail equity traders got when Yahoo Finance started publishing fundamentals for free in the late nineties. The informational edge that used to belong to people who paid two grand a month for a Bloomberg chair is, by a chunk, no longer their edge. Turning raw data into alpha still takes work. It just takes less of it than it used to.
Where it doesn't help
I don't want to be blue-sky about this. A few specific things the release doesn't solve:
It's Polymarket only. The dataset covers one venue. The most interesting thing about prediction markets in 2026 is that the same question often trades on Kalshi, Polymarket, and Gemini simultaneously with non-trivial spreads between them. Cross-venue research requires integrating this release with whatever you can scrape off the other two. That is not a dataset anyone has published yet.
It's a daily snapshot, not a real-time feed. The update cadence is "daily." That's fine for retrospective research. It is not useful for real-time trading. If you are looking for latency-sensitive edges, you still need to build your own feed.
The unit of analysis is the trade, not the order book. The 107 GB includes OrderFilled events — what actually transacted. It does not publish the full resting book state over time. A lot of interesting microstructure questions ("how deep was the book when this big trade hit?") are approximated, not answered, with this release.
User-level data is a mixed blessing. users.parquet contains 340 million rows of per-address maker/taker behavior. On a blockchain, none of this was private to begin with — it is all public on Polygon. But consolidating it into a single pre-processed corpus makes targeted analysis of specific addresses dramatically easier. If a Polymarket trader you know has ever been doxxed on Twitter, their complete trading record is now two lines of Python away. That cuts both ways and it is worth being honest about.
Everybody else gets the same dataset. This is the quiet thing about democratized data: it reduces the edge of whoever used to be paying for it, but it does not automatically create edge for the people who just got access. The prop shops running Polymarket strategies have had something close to this internally for a year. Retail getting the same data closes half the gap, maybe. The other half is still the workflow, tooling, and attention-budget difference between somebody with a full-time job and somebody who trades for a living.
What it means for the category
Three things happen when a category crosses the "academic dataset is available" line, and all three are happening to prediction markets right now.
First, the research literature catches up. I expect a substantial uptick in finance and economics papers using this corpus over the next twelve to eighteen months. Calibration studies, microstructure studies, collective-intelligence studies. That research feeds back into how the category gets perceived. "Serious enough for a Journal of Financial Economics paper" is itself a legitimizing signal, and signal begets capital.
Second, the tooling ecosystem gets real. Once anybody can build a backtesting harness on the same underlying data, the bar for new tooling rises fast. You will see specialized analytics products, category-specific dashboards, research interfaces that actually know what they are talking about because they were trained on this corpus. Teams that couldn't ship before because they didn't have the data can now ship.
Third, capital follows legibility. Allocators who wouldn't write a check into "a prediction market strategy" in 2024 because the data couldn't be audited will write one in 2026 because the data can. Not immediately, not in huge sizes, but in a steady trickle. Once one of those strategies is visibly outperforming, the trickle becomes a flow.
None of this is overnight. All three are the direction of travel. This release is the kind of event that gets cited years later as the moment the category became a research area in its own right.
What it means for us
Briefly, because not every post needs to end in a pitch.
Tykhy is a retail interface for prediction markets. This release makes that interface more valuable, not less. The research loop we can build on top of the SII corpus — category-specific calibration, cross-venue spread analysis, peer performance comparisons, an AI research layer grounded in the actual history of how markets cleared — is a better product the more of this data is open. The more research is public, the less our users are paying for data access and the more they are paying for how we turn that data into a workflow that fits in their day.
We'll be integrating against the SII dataset over the next few weeks. When there's something real to show, we'll write about it here.
For now: go download it. Open a notebook. Look at something you were curious about. This is the most interesting artifact the category has produced in months, and the people who know what's in it first are going to have the informational advantage for a while.
— Ilhan