Featured image of post Publicly available datasets

Publicly available datasets

All about where to find datasets for data science tasks.

Introduction

One of the easiest ways to get data for any data science project is to use readily available datasets. In recent years, more and more people have been sharing data with others, making data science accessible to everyone—students, researchers, and data scientists from both small and large organizations.

Types of public datasets

When browsing various websites, we can find different types of publicly available data. Some are structured, already cleaned and prepared for use in machine learning tasks—typically in formats like CSV or JSON.

Others are unstructured, such as text, images, videos, or audio, which require proper preprocessing before they can be analyzed.

We can also categorize datasets by domain, for example: healthcare, climate, social sciences, or finance. Here are some popular websites to gather data from:

Financial data sources:

• World bank data - https://data.worldbank.org/
• Global financial data - https://www.globalfinancialdata.com/
• UN comtrade database - https://comtrade.un.org/
• National bureau of economic research - https://www.nber.org/

Crime data:

• FBI Crime Data – https://www.fbi.gov/services/cjis/ucr

Biological/chemical databases:

• PDB – https://www.wwpdb.org/
• ChEMBL – https://www.ebi.ac.uk/chembl/

Cybersecurity datasets:

• Cyber Data Scientist – https://cyberdatascientist.com/datasets/
• Secrepo – http://www.secrepo.com/
• Awesome Security Datasets – https://github.com/shramos/Awesome-Cybersecurity-Datasets

Climate datasets:

• NOAA Datasets – https://www.ncdc.noaa.gov/cdo-web/datasets

Some websites host a variety of datasets across multiple domains. The most popular include:

General-purpose repositories:

• Kaggle – https://www.kaggle.com/datasets
• UCI Machine Learning Repository – https://archive.ics.uci.edu/datasets
• Google Dataset Search – https://datasetsearch.research.google.com/
• Hugging Face Datasets – https://huggingface.co/docs/datasets/
• KDnuggets – https://www.kdnuggets.com/datasets/index.html
• Awesome Public Datasets Collection – https://github.com/awesomedata/awesome-public-datasets
• Amazon Open Datasets – https://registry.opendata.aws/

Government portals:

• USA – https://data.gov/
• UK – https://data.gov.uk/
• US Census Data – https://www.census.gov/data.html
• EU Open Data Portal – https://data.europa.eu/en

Academic and research repositories:

• OpenML – https://www.openml.org/
• Academic Torrents – https://academictorrents.com
• PhysioNet – https://physionet.org/
• Harvard Dataverse – https://library.harvard.edu/services-tools/harvard-dataverse

How to use them

Before using any dataset from these websites, it is essential to review the licensing information to understand the terms of use. Different datasets are distributed under different licenses, and it’s important to ensure that your intended use—especially for commercial or public projects—is permitted.

Conclusions

Public datasets are a valuable starting point for data science projects. Whether you’re building a portfolio, teaching a class, or conducting advanced research, these resources can accelerate your work. However, always review the terms of use to avoid any legal or ethical issues.

comments powered by Disqus