Introduction
One of the easiest ways to get data for any data science project is to use readily available datasets. In recent years, more and more people have been sharing data with others, making data science accessible to everyoneβstudents, researchers, and data scientists from both small and large organizations.
ποΈ Types of public datasets
When browsing various websites, we can find different types of publicly available data. Some are structured, already cleaned and prepared for use in machine learning tasksβtypically in formats like CSV or JSON.
Others are unstructured, such as text, images, videos, or audio, which require proper preprocessing before they can be analyzed.
We can also categorize datasets by domain, for example: healthcare, climate, social sciences, or finance. Here are some popular websites to gather data from:
πΈ Financial data sources:
β’ World bank data - https://data.worldbank.org/
β’ Global financial data - https://www.globalfinancialdata.com/
β’ UN comtrade database - https://comtrade.un.org/
β’ National bureau of economic research - https://www.nber.org/
π Crime data:
β’ FBI Crime Data β https://www.fbi.gov/services/cjis/ucr
𧬠Biological/chemical databases:
β’ PDB β https://www.wwpdb.org/
β’ ChEMBL β https://www.ebi.ac.uk/chembl/
π Cybersecurity datasets:
β’ Cyber Data Scientist β https://cyberdatascientist.com/datasets/
β’ Secrepo β http://www.secrepo.com/
β’ Awesome Security Datasets β https://github.com/shramos/Awesome-Cybersecurity-Datasets
π Climate datasets:
β’ NOAA Datasets β https://www.ncdc.noaa.gov/cdo-web/datasets
π Popular sources
Some websites host a variety of datasets across multiple domains. The most popular include:
π General-purpose repositories:
β’ Kaggle β https://www.kaggle.com/datasets
β’ UCI Machine Learning Repository β https://archive.ics.uci.edu/datasets
β’ Google Dataset Search β https://datasetsearch.research.google.com/
β’ Hugging Face Datasets β https://huggingface.co/docs/datasets/
β’ KDnuggets β https://www.kdnuggets.com/datasets/index.html
β’ Awesome Public Datasets Collection β https://github.com/awesomedata/awesome-public-datasets
β’ Amazon Open Datasets β https://registry.opendata.aws/
ποΈ Government portals:
β’ USA β https://data.gov/
β’ UK β https://data.gov.uk/
β’ US Census Data β https://www.census.gov/data.html
β’ EU Open Data Portal β https://data.europa.eu/en
π Academic and research repositories:
β’ OpenML β https://www.openml.org/
β’ Academic Torrents β https://academictorrents.com
β’ PhysioNet β https://physionet.org/
β’ Harvard Dataverse β https://library.harvard.edu/services-tools/harvard-dataverse
π How to use them
Before using any dataset from these websites, it is essential to review the licensing information to understand the terms of use. Different datasets are distributed under different licenses, and itβs important to ensure that your intended useβespecially for commercial or public projectsβis permitted.
Conclusions
Public datasets are a valuable starting point for data science projects. Whether you’re building a portfolio, teaching a class, or conducting advanced research, these resources can accelerate your work. However, always review the terms of use to avoid any legal or ethical issues.