If “the data is the new oil” then there is a lot of free oil just waiting to be used. And you can do some pretty interesting things with that data, like finding the answer to the question: Is Buffalo, New York really that cold in the winter?
There is plenty of free data out there, ready to be us...
If “the data is the new oil” then there is a lot of free oil just waiting to be used. And you can do some pretty interesting things with that data, like finding the answer to the question: Is Buffalo, New York really that cold in the winter?
There is plenty of free data out there, ready to be used for school projects, market research, or just for fun. Before you go crazy, however, you should be aware of the quality of the data you find. Here are some great sources of free data and some ways to determine their quality.
All of these dataset sources have strengths, weaknesses, and specialties. All in all, these are great pieces of equipment and you can spend a lot of your time digging rabbit holes.
But if you want to stay focused and find what you need, it’s important to understand the nuances of each source and use their strengths to your advantage.
Size: 2.31 MB
Language: en
Added: Jan 03, 2022
Slides: 22 pages
Slide Content
THE 10 BEST PLATFORMS
TO FIND FREE DATASETS
www.newsdata.io
If “the data is the new oil” then there is a lot of free oil just waiting to be used. And
you can do some pretty interesting things with that data, like finding the answer to
the question: Is Buffalo, New York really that cold in the winter?
There is plenty of free data out there, ready to be used for school projects, market
research, or just for fun. Before you go crazy, however, you should be aware of the
quality of the data you find. Here are some great sources of free data and some ways
to determine their quality.
All of these dataset sources have strengths, weaknesses, and specialties. All in all,
these are great pieces of equipment and you can spend a lot of your time digging
rabbit holes.
But if you want to stay focused and find what you need, it’s important to understand
the nuances of each source and use their strengths to your advantage
Newsdata.io API
Nomics is a cryptocurrency data API focused on price, cryptocurrency market cap,
supply, and all-time maximum data. They offer Candle / OHLC data for currencies
and exchanges.
Additionally, they provide historical aggregate cryptocurrency market caps since
January 2013. API Nomics is a resource for all developers.
However, they are a highly respected API in the cryptocurrency industry. An overall
positive experience with Nomics leads me to discover what it has to offer. Nomics’
API is pretty straightforward to use, but when I started building crypto apps a few
years ago, their API was a bit demanding for me.
If you want historical candlestick data for currencies and exchange rates, raw trade
data without pauses, and/or order book data, you will need to pay for these services.
The Documentation: https://p.nomics.com/cryptocurrency-bitcoin-api
1. Google Dataset Search
Kaggle is a popular data science competition website that provides free public
datasets that you can use to learn more about artificial intelligence (AI) and machine
learning (ML).
Organizations use Kaggle to display a prompt and # 40, as cassava leaf disease
classification and # 41; and teams from around the world will compete against each
other to solve it using algorithms (and win a cash prize).
Kaggle is quite prominent in the data science community because it provides a way to
test and demonstrate your skills — your performance in the Kaggle competition
sometimes shows up in job interviews for AI / ML positions.
2. Kaggle
After these competitions, the datasets are made available for use. At the time of
writing, Kaggle has a collection of over 68,000 datasets, which he organizes using a
system of tagging, usability scores, as well as positive reviews and negative.
Kaggle has a strong community on their site, with discussion boards within each
dataset and within each competition. There are also active communities outside of
Kaggle, such as r / kaggle, which share tips and tutorials.
All of this is to say that Kaggle is more than just a free dataset distributor; it’s also a
way to test your skills as a data scientist. Free datasets are a side benefit that anyone
can take advantage of.
GitHub is the global standard for collaborative and open-source online code
repositories, and many of the projects it hosts have datasets you can use. There is a
specific project for public datasets aptly called Awesome Public Datasets.
Like Kaggle, the datasets available on GitHub are a side benefit of the site’s real
purpose. In the case of GitHub, this is primarily a code repository service.
This is not a data repository optimized for discovering datasets, so you might need to
get a little creative to find what you’re looking for, and it won’t have the same variety
as Google or Kaggle.
3. GitHub
Many government agencies make their data freely available online, allowing anyone
to download and use public datasets. You can find a wide variety of government data
from municipal, state, federal, and international sources.
These datasets are great for students and those focusing on the environment, the
economy, healthcare (a lot of these types of data due to COVID19), or demographics.
Keep in mind that these aren’t the most stylish sites of all time — they are mostly
focused on function rather than style.
4. Government Sources
FiveThirtyEight is a data journalism website that occasionally makes its datasets
available. Their original focus was sport but has since spread to pop culture, science
and (most famous) politics.
The datasets made available by FiveThirtyEight are highly organized and specific to
their journalistic production. Unlike the other options on this list, you’ll likely end up
browsing inventory rather than searching.
And you might come across some fun and interesting data sets, like 50 years of a
World Cup doppelganger.
5. FiveThirtyEight
Data.world is a data catalog service that simplifies collaboration on data projects.
Most of these projects make their datasets available free of charge.
Anyone can use data.world to create a workspace or a project that hosts a dataset. A
wide variety of data is available, but it is not easy to navigate. You will need to know
what you are looking for to see results.
Data.world requires login to access their free community plan, which allows you to
create your own projects / datasets and provides access to others’ projects / datasets.
You will need to pay to access multiple projects, datasets, and repositories.
6. Data.world
Newsdata.io is a news API and they collect worldwide news data on a daily basis
and they offer that news data with their news API.
They also provide free news datasets and the best is that you can also make a news
dataset according to your requirement with the help of Newsdata.io news API in
python, which may take longer when you are fetching large sums of data.
7. Newsdata.io news datasets
Amazon makes large datasets available on its Amazon Web Services platform. You
can download the data and use it on your computer, or analyze the data in the cloud
using EC2 and Hadoop via EMR. You can read more about how the program works
here.
Amazon has a page that lists all the datasets to browse. You will need an AWS
account, although Amazon does provide you with a free level of access for new
accounts that will allow you to explore data at no cost.
8. AWS Public Data sets
Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an
astonishing expanse of knowledge, with pages on everything from the Ottoman Wars
of the Habsburgs to Leonard Nimoy.
As part of Wikipedia’s commitment to the advancement of knowledge, they offer all
of their content free of charge and regularly generate dumps of all articles on the site.
In addition, Wikipedia offers a history of changes and activities, which allows you to
follow the evolution of a page on a topic over time and to know who contributes to it.
You can find different ways to download the data on the Wikipedia site. You will also
find scripts to reformat the data in various ways.
9. Wikipedia
The UCI Machine Learning Repository is one of the oldest sources of datasets on the
web. While the datasets are user-supplied and therefore have varying levels of
documentation and cleanliness, the vast majority are clean and ready to apply. UCI is
a great first stop when looking for interesting datasets.
The data can be downloaded directly from the UCI Machine Learning repository,
without registration. These datasets tend to be quite small and don’t have a lot of
projects/datasets nuances, but they are useful for machine learning.
10. UCI Machine Learning Repository
Free data is great, High-quality free
data is better. If you want to do a great
job with the data you find, you need to
do your due diligence to make sure it’s
good quality data by asking a few
questions.
Quality data gives you quality work
Newsdata.io API
Newsdata.io API
Should I trust the data source?
First, consider the overall reputation of your data source. Ultimately, datasets are
created by humans, and those humans may have specific agendas or biases that
can translate into your work.
All of the data sources we have listed here are reliable, but there are several data
sources that are not as reliable. The only downside to our listing here is that
community-provided collections, such as data.world or GitHub, may vary in quality.
If you have doubts about the reputation of your data source, compare it with similar
sources on the same topic.
Newsdata.io API
Could the data be Incorrect?
Next, examine your data set for any inaccuracies. Again, humans create these
datasets and humans are not perfect. There may be errors in the data which, using a
few quick tips, you can quickly identify and correct.
First tip: calculate estimates for the minimum and maximum for any of your columns.
Check if the values in your dataset are outside of this using the filtering and sorting
options, shown here:
Let’s say you have a small data set on used car prices. You would expect the price
data to be somewhere between $ 7,000 and $ 20,000 or so. When you filter the
price column from low to high, the low price probably shouldn’t be very far from $
7,000.
Newsdata.io API
But humans can make mistakes and enter data incorrectly: Instead of $ 11,000.00,
someone can type $ 1,100.00 or $ 11.00.00. Another common example is that
sometimes people don’t want to provide actual data for things like phone numbers.
You can get a lot of 9999999999 or 0000000000 in these columns.
Also, pay attention to the column headings. A field can be titled “% occupied” and
the entries can have 0.80 or 80. Both could mean 80% but would show up differently
in the final data set.
Then check for errors. If these are simple and obvious mistakes, correct them. If they
are clearly incorrect, remove the entry from the dataset so that they do not collapse.
It is very common for a dataset to run out of data. Before you start working with the
dataset, it is a good idea to check for null or missing values. If there are a lot of NULL
values, the dataset is incomplete and may not be good to use.
In Excel, you can do this by using the COUNTBLANK function, for example,
COUNTBLANK (B1: B3) in the following image gives a number of 1.
Too many zero values probably mean an incomplete data set. some null values, but
not too many, you can pass and replace null values with 0 using SQL, or you can do it
manually.
Could the Data Be Unfinished?
Newsdata.io API
Understanding how your data set is asymmetric will help you choose the right data to
analyze. It’s helpful to use visualizations to see how skewed your dataset is, as it’s
not always obvious by just looking at the numbers.
For numeric columns, use a histogram to see the type of distribution of each column
(normal, left, right, uniform, bimodal, etc.).
Strict recommendations of what to do next based on the dataset, but overall the way
it is biased will give a general idea of the quality of the data and suggest which
columns to use in the analysis. You can then use this general idea to avoid
misrepresenting the data
How to know if the data is skewed?
Newsdata.io API
For non-numeric columns, use a frequency table to see
how many times a value is displayed. In particular, you
might want to check if there is mainly a value present.
If so, your analysis may be limited due to the low
diversity of values. Again, this is just to give you a
general idea of the quality of the data and indicate
which relevant columns to use.
You can create these visuals and frequency tables in
Excel or Google Sheets using CSV, but you might want
to turn to a Business Intelligence (BI) tool for complex
data sets.
Newsdata.io API
Once you have your data and are confident in its quality, it’s time to put it to work.
You can go a long way with tools like Excel, Google Sheets, and Google Data Studio,
but if you really want best practices for your career data, you need to be familiar
with the real deal: a BI platform.
A BI platform will provide powerful data visualization capabilities for any data set,
from small CSVs to large data sets hosted in data warehouses, such as Google
BigQuery or Amazon Redshift. You can play around with your data to create
dashboards and even collaborate with others.
Use free datasets
Newsdata.io API