Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker
AWS Big Data
FEBRUARY 1, 2024
Common Crawl data The Common Crawl raw dataset includes three types of data files: raw webpage data (WARC), metadata (WAT), and text extraction (WET). Data collected after 2013 is stored in WARC format and includes corresponding metadata (WAT) and text extraction data (WET).
Let's personalize your content