We found some great tools online, but none reliably handled the entire process. We wanted an API that took a URL, crawled the pages in the URL, and gave us an easy-to-use, up-to-date markdown we could feed into our index.
So, we released an open-source repo and an API that crawls and turns entire websites into a markdown with just a few lines of code
The API handles:
- Crawling without consistent sitemaps - Infra to handle running many crawling jobs - Proxying, hosting headless browsers at scale - Conversion to clean markdown - Caching - Handling images, videos (soon), and tables(soon) - LLM extraction (soon)
It is open source, and we also offer an easy-to-use API that starts free. It has built-in loaders for both @llama_index and @langchain.
Excited to see people try it