Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. Web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.
The Common Crawl corpus contains data from billions of web pages and is available for free from Amazon Web Services Public Data Sets. Common Crawl's Hadoop classes and other code can be found on GitHub.
The Common Crawl team is very excited to be part of the dotScale conference! The conference organizers have created a program full of substantial content delivered by leading thinkers and attracted hundreds of talented developers to attend. It is sure to be a stimulating two days of presentations and conversations!