The Pandas DataFrame is a labeled data structure where data is aligned in a tabular fashion in rows and columns. Its format is similar to that of an excel spreadsheet or a SQL table.
The columns of the DataFrame can be heterogeneous, i.e., the columns can have elements of different data types. In this chapter, you will learn about various methods for creating DataFrames. You will also learn about selecting data from a DataFrame.
The general syntax for creating a simple DataFrame is similar to that of creating a Pandas Series:
df = pandas.DataFrame(data=data, index=index)
Pandas DataFrame accepts many different kinds of data inputs like a dictionary of 1-D n-d arrays, lists, dicts, or Series, 2-D NumPy ndarray, structured or record ndarray, another DataFrame. Along with the data, you can optionally pass index (row labels) and columns (column labels).
How to Create a Pandas DataFrame?
There are numerous ways by which we can create a Pandas DataFrame. Some of the most widely used methods for creating a Pandas DataFrame are discussed below:
1. Creating a Pandas DataFrame from a Python List
To create Pandas DataFrame from a Python list, we need to create a list and pass it as a data parameter in pd.DataFrame().
# Importing pandas module import pandas as pd # Initializing a Python List lst = ['English', 'French', 'Bengali', 'Urdu'] # Creating a Pandas DataFrame df = pd.DataFrame(lst) print(df)
If you want to create a Pandas DataFrame with meaningful columns, you may specify the columns parameter during series creation. Similarly, you can also assign index values to the series by specifying the index parameter.
# Initializing a Python List lst = ['English', 'French', 'Bengali', 'Urdu'] # Creating a Pandas DataFrame by specifying values of column and index df = pd.DataFrame(lst, columns = ['Language'], index = [1,2,3,4]) df
2. Creating a Pandas DataFrame from multiple lists
We can create a Pandas DataFrame with multiple columns by using multiple lists.
# Initializing Multiple Lists lst1 = ['Apple','1kg',300] lst2 = ['Orange','2kg',150] lst3 = ['Mango','5kg',800] # Creating a Pandas DataFrame from multiple lists df_fruits = pd.DataFrame(data = [lst1,lst2,lst3], columns = ['Fruits','Quantity','Price']) df_fruits
3. Creating a Pandas DataFrame from a dictionary of nd arrays or lists
Pandas DataFrame can be created from a dictionary whose values are lists or nd arrays. The keys of the dictionary will be the column names for the DataFrame.
# Making necessary imports import pandas as pd # Defining a dictionary of nd arrays or lists d = {'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]} # Creating a DataFrame from the Dictionary df0 = pd.DataFrame(d) # Printing dataframe and its index print(df0) print(df0.index) # Show well formatted dataframe in jupyter notebook or jupyter lab df0
OUTPUT:
one two 0 1 4 1 2 3 2 3 2 3 4 1 RangeIndex(start=0, stop=4, step=1)
Creating a Pandas DataFram from a Dictionary of Pandas Series
We can also pass a Dictionary whose values are Pandas Series to the pandas.DataFrame method to create DataFrames.
# Making necessary imports import pandas as pd # Defining a dictionary of Pandas series dict_data= {'Column_one': pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'Column_two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} # Creating a DataFrame from the Dictionary df1 = pd.DataFrame(dict_data) # Printing dataframe and its index print(df1) print(df1.index)
OUTPUT:
Column_one Column_two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 Index(['a', 'b', 'c', 'd'], dtype='object')
Creating a Pandas DataFrame from a Structured or Record array
Structured arrays are ndarrays whose datatype is a composition of simpler datatypes organized as a sequence of named fields. Structured arrays can also be passed into the pandas.DatFrame() to create a Pandas DataFrame.
# Making necessary imports import pandas as pd import numpy as np # Defining structured or record array data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')]) data[:] = [(1, 2., 'Hello'), (2, 3., "World")] # Creating a DataFrame df2=pd.DataFrame(data) print(df2) # Printing the DataFrame
OUTPUT:
A B C 0 1 2.0 b'Hello' 1 2 3.0 b'World'
How to select data from a DataFrame?
After creating the DataFrame, it is essential to be able to select particular data from the DataFrame. Luckily, Pandas provides an easy and efficient way to select certain data from a DataFrame. The methods for selecting particular rows and columns are discussed below
Selecting columns from a Pandas DataFrame
The syntax for selecting a particular column is
DataFrameName['column_name']
EXAMPLE:
# Making necessary imports import pandas as pd # Defining a DataFrame with three columns namely 'one', 'two' and 'three' df0 = pd.DataFrame({'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1], 'three': [5, 6, 7, 8]}) # Selecting the column 'one' df0['one']
OUTPUT:
0 1 1 2 2 3 3 4 Name: one, dtype: int64
We can also select multiple columns from the DataFrame by passing a list of column names. In the above example, we can select only the columns ‘one’ and ‘two’ by
df0[['one', 'two']]
OUTPUT:
one two 0 1 4 1 2 3 2 3 2 3 4 1
Note: When only one column is selected, the result is a Pandas Series whereas when multiple columns are selected, the result is a Pandas DataFrame.
Selecting rows from a Pandas DataFrame
Pandas uses iloc() method to extract rows using an imaginary index position which isn’t visible in the data frame. For understanding iloc, let us take the following example:
# Making necessary imports import pandas as pd data = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, {'a': 100, 'b': 200, 'c': 300, 'd': 400}, {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] df = pd.DataFrame(data) print(df) #printing the DataFrame
OUTPUT:
a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000
Now, some of the examples of selecting particular data from the DataFrame using iloc are illustrated below:
# Selecting value 2 print("First:\n",df.iloc[0,1]) # Selecting column index 1 i.e., second column of data frame 'b' print("\nSecond:\n",df.iloc[:,1]) # Selecting row from index 1 to 2 and col from index 1 to 2 print("\nThird:\n",df.iloc[1:3, 1:3])
First: 2 Second: 0 2 1 200 2 2000 Name: b, dtype: int64 Third: b c 1 200 300 2 2000 3000
This is how you create DataFrames and select particular rows or columns from it.
Now, in the next chapter, you will learn how to load external data as Pandas Series or DataFrame.