This website will introduce you to the process of using Scikit-Learn, a Python package for machine learning. Learn how this framework can help you access datasets in order to build predictive models and more!
Scikit-Learn uses datasets to represent the input and output of machine learning algorithms. These datasets can be accessed in two ways: on disk or via the Scikit-Learn API.
Scikit-learn Datasets Scikit-learn, a Python machine learning toolkit, includes a variety of datasets that may be used to study machine learning and build new approaches. If you’re new to sklearn, it might be difficult to understand what datasets are accessible, what information is included in the datasets, and how to access the datasets. The datasets are well-documented in the sckit-learn user handbook. Here’s a short rundown of the datasets available and how to get started utilizing them right away.
Let’s start by importing scikit-learn and checking its version. We have sklearn v 1.0 here.
# import sklearn # import scikit-learn Check the sklearn. version ‘1.0’ version.
We can retrieve datasets from sklearn using Scikit-“datasets” learn’s module. Small “Toy Datasets” are built-in, somewhat bigger “Real World datasets” may be acquired using the scikit-learn API, and simulated datasets or produced datasets using random variables for studying numerous Machine Learning methods are all available in scikit-learn.
Let’s use sklearn to import ‘datasets.’
# import datasets from sklearn using the datasets module from scikit-learn
Then we can use the dir() method to inspect all of the dataset properties. The names of the datasets included in the datasets package are of particular interest to us.
dir(datasets)
It will return a large list of dataset properties, including all dataset accessor names.
sklearn: Load Toy Datasets
We use list comprehension to filter the dataset names that begin with “load” to view the list of “Toy Datasets” in the datasets package. In scikit-learn, we can also view a list of built-in datasets.
[data in dir(datasets) for data if data.startswith(“load”)] [‘load boston’, ‘load breast cancer’, ‘load diabetes’, ‘load digits’, ‘load files’, ‘load iris’, ‘load linnerud’, ‘load sample image’, ‘load sample images’, ‘load sample images’, ‘load svmlight file’,
Each of the examples above is a built-in dataset.
How to Use Scikit-Learn to Load “Toy Datasets”
Let’s look at how to load or access one of the toy datasets now that we have the list of all toy datasets accessible in sklearn.
Let’s look at how to use the load iris() function in the “datasets” package to load the traditional iris dataset.
datasets.load iris = iris= iris= iris= iris= iris= iris= ()
Each dataset is stored in a dictionary-like format by Scikit-learn. As previously, we can use the dir() method to examine the properties of the iris data set.
[‘DESCR’, ‘data’, ‘data module’, ‘feature names’, ‘filename’, ‘frame’, ‘target’, ‘target names’] dir(iris) [‘DESCR’, ‘data’, ‘data module’, ‘feature names’, ‘filename’, ‘frame’, ‘target names’]
Because it’s a dictionary-type object, we may use the “dot” operator or square bracket notation to access each of the characteristics like DESCR, data, and target.
For example, we may use iris.DESCR (or iris[‘DESCR’]) to acquire the data description.
_iris dataset: Iris plants dataset print(iris.DESCR) —————————— ** **Characteristics of the Data Set: 150 instances (50 in each of the three classes): Number of Attributes: 4 numeric, predictive attributes, as well as the following class: Information on the attributes: – the length of the sepals in centimeters – sepal width in centimeters – petal length in centimeters – petal width in centimeters – Iris-Setosa – Iris-Versicolor – Iris-Virginica – – – – – – – – – – – – – – – – – – – – – –
We use iris[‘data’] to retrieve the data, which returns a numpy 2D array.
iris[‘data’] [0:5,] a collection ([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5. , 3.6, 1.4, 0.2]])
We may retrieve the feature names or column names of the data by using iris[‘feature names’].
iris [‘feature names’] [‘sepal length (cm)’,’sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]’sepal length (cm)’, ‘petal width (cm)’
Similarly, we use iris[‘target’] to retrieve the target group.
array iris[‘target’] ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
We may also use iris[‘target names’] to acquire the names of the target groups, as illustrated below.
iris[‘target names’] array([‘setosa’,’versicolor’, ‘virginica’], dtype=”U10″) iris[‘target names’] array([‘setosa’,’versicolor’, ‘virginica’], dtype=”U10″)
The following is a list of all Real World Datasets in sklearn.
Similarly, by filtering for names that begin with “fetch,” we may view a list of all bigger “Real World” datasets accessible in the datasets package. These are significantly larger datasets, and we can use Scikit-datasets learn’s API to obtain them by name.
[if data.startswith(“fetch”), data for data in dir(datasets)] [‘fetch 20newsgroups vectorized’, ‘fetch california housing’, ‘fetch covtype’, ‘fetch kddcup99’, ‘fetch lfw pairs’, ‘fetch lfw people’, ‘fetch olivetti faces’, ‘fetch openml’, ‘fetch rcv1’,
In scikit-learn, how to load a “Real World Dataset”
To download the California housing dataset, for example, we call “fetch california housing(),” which returns the data in a dictionary-like structure.
datasets.fetch california housing = ca housing ()
Using the dir() method, we can obtain a list of all the characteristics.
dir(ca housing) [‘DESCR’, ‘data’, ‘feature names’, ‘frame’, ‘target’, ‘target names’] ‘DESCR’, ‘data’, ‘feature names’, ‘frame’, ‘target’, ‘target names’]
And you may use either “dot” notation or square bracket notation to retrieve the data. A Numpy Array is used to hold the data.
ca housing[‘data’] [0:3,] [[8.32520000e+00, 4.10000000e+01, 6.98412698e+00, 1.02380952e+00, 3.22000000e+02, 2.55555556e+00, 3.78800000e+01, -1.22230000e+02], [8.30140000e+00, 2.10000000e+01, 6.23813708e+00, 9.71880492e-01, 2.40100000e+03, 2.
The column names of the dataset are provided by the parameter “feature names.”
ca housing[‘feature names’] [‘MedInc,’HouseAge,’AveRooms,’AveBedrms,’Population,’AveOccupation,’Latitude,’Longitude’] array ca housing[‘target’] ([4.526, 3.585, 3.521, …, 0.923, 0.847, 0.894]) ca housing[‘target names’] [‘MedHouseVal’]
All simulated Datasets in sklearn are listed here.
In addition to toy and real-world datasets, sklearn provides a number of simulated datasets for learning and testing a range of Machine Learning algorithms. The names of all of these “produced” datasets begin with “make.” The following is a list of all Scikit-learn simulated datasets.
[data in dir(datasets) for data if data.startswith(“make”)] [‘make biclusters’,’make blobs’,’make checkerboard’,’make circles’,’make classification’,’make friedman1′,’make friedman2′,’make friedman3′,’make gaussian quantiles’,’make hastie 10 2′,’make hastie 10 2′,’make_
In scikit-learn, how can I get simulated data?
Let’s have a look at how to import make regression, one of the simulated datasets (). We create 20 datapoints with noise and save them as X, Y, and coef in this example.
sklearn.datasets.make regression X,Y,coef (n features=1, n informative=1, noise=10, coef=True, random state=0) (n samples=20, n features=1, n informative=1, noise=10, coef=True, random state=0)
This is how our data looks.
X array([[-0.15135721], [0.40015721], [0.97873798], [-0.85409574], [-0.97727788], [0.3130677], [-0.10321885], [-0.20515826], [0.33367433], [1.49407907], [0.95008842], [0.12167502], [1.45427351], [1.86755799], [0.14404357], [0.4105985], (14.33532874)
Using fetch openml, you can get datasets from Scikit-learn ()
Use fetch openmal to retrieve data in another method (). Here’s an example of using fetch openml to get housing data ().
retrieved from sklearn.datasets housing = fetch openml(name=”house prices”, as frame=True) import fetch openml (housing) [‘DESCR’, ‘categories’, ‘data’, ‘details’, ‘feature names’, ‘frame’, ‘target’, ‘target names’, ‘url’] [‘DESCR’, ‘categories’, ‘data’, ‘details’, ‘feature names’, ‘frame’, ‘target’, ‘target names’, ‘url’]
One of the benefits of using open fetchml() to acquire data is that the data is returned as a Pandas dataframe.
housing[‘data’].head() Id MSSubClass Frontage Lot MSZoning MSZoning MSZoning MSZoning MSZoning MSZ Lot in a street alley Landforms Contour Utilities is a company that offers a variety of services. Pool on the ScreenPorch Area Swimming Pool QC Fence MiscFeature MiscVal YrSold MoSold MoSold MoSold MoSold MoSold MoSold MoS Types of Sales RL 65.0 8450.0 Condition 0 1.0 60.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 None None None None None None None None None None None None None None None None WD2008.0 1 2.0 20.0 RL 80.0 9600.0 Normal 1 2.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 None None None None None None None None None None None None None None None None None WD2007.0 2 3.0 60.0 RL 68.0 11250.0 Normal 2 3.0 60.0 None IR1 Lvl AllPub… 0.0 0.0 Pave None IR1 Lvl AllPub Pave None IR1 Lvl AllPub Pave None WD Normal 3 4.0 70.0 RL 60.0 9550.0 None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None IR1 Lvl AllPub… 0.0 0.0 Pave None IR1 Lvl AllPub Pave None IR1 Lvl AllPub Pave None 0.0 2.0 2006.0 WD None None None None None None None None None None None None None None None None None None None None 4 5.0 60.0 RL 84.0 14260.0 Abnorml None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None None 5 rows x 80 columns is the standard.
Watch This Video-
The “sklearn datasets list” is a command that allows users to view all of the available datasets in Scikit-Learn.
Frequently Asked Questions
What datasets are available in Sklearn?
A: There are two datasets of which one is the Iris dataset and the other is a classification problem.
How do I download Sklearn datasets?
A: You can download the datasets from here: https://archive.ics.uci.edu/ml/.
How many datasets are in Sklearn?
A: There are around 1,000 datasets in Sklearn.
Related Tags
- sklearn load dataset from csv
- load iris dataset in python as dataframe
- iris dataset classification python code
- import sklearn datasets
- sklearn datasets for classification