Zur Langanzeige
Overall, we have collected 203,886 online articles that were published on three platforms between January 23, 2020 and June 22, 2020. and are the websites of the respective international news companies owned by Thomson Reuters and New York Times Company. The covered topics include business, politics, financial markets, science or health. In addition, we have also collected data from MarketWatch, which purely focuses on financial news and stock market data. The MarketWatch articles contain the most words on average (706) and the lowest maximum count (3857). The data collection process consists of three steps. First, we gather the URLs of the online articles either through the API or web crawling. The Reuters and MarketWatch crawler are developed using a link extractor written in Python Scrapy. The main goal of web scraping is to extract structured data from unstructured web pages. Scrapy contains the Spider class which can be used to define how to crawl and parse pages to extract items from a particular site (e.g., specifying the links). In addition, the Item class supports the creation of a container to collect the scraped data. The API and the crawler allow us to store the meta data in the database, such as headline, author, publish date and URL. Afterwards, we filter the COVID-19 URLs by focusing on the related keywords, such as ‘COVID” and “Corona”. In the last step, we collect all text elements (p-tags) from the remaining URLs, i.e. date, title, author and text. Figure 1 depicts the weekly number of collected articles on NYTimes, Reuters and MarketWatch during the course of the pandemic.
Financial Markets
covid-19 news, sentiment analysis, stock markets
G10, G14, G15
Working Paper Referenzen
Research Data
Link zur Publikation
- External Research Data [777]