Python Pandas

What is Pandas?

Pandas is Python’s fundamental data analysis library that provides data structures to work with table like data. It runs on the top of NumPy. Pandas is open-source and easy to use for data analysis. Pandas can deal with

  1. Table like data, for example SQL table or Excel Spreadsheet
  2. Time series data in ordered or unordered format
  3. Arbitrary matrix data
  4. Slicing, Indexing and Subsetting using labels
  5. Handling missing data (NaN)
  6. Group by functions
  7. Munging and cleaning data, analysing data, plotting and tabular representation.

Pandas vs NumPy

  1. Pandas is built on top of NumPy
  2. Pandas has high level data structures (data frame). NumPy has low-level data structures (numpy.array)
  3. Panda is great for handling tabular data and performing data alignment, group by, merge and join etc. NumPy is fantastic for mathematical array operations.

Pandas Series and DataFrames

Series and DataFrames are the most important objects in Pandas.

Pandas Series

Series is a one-dimensional labelled array.

  1. Data must be homogeneous (same type)
  2. Data can be any type like integer, float, python object, string etc
  3. Data must always have an index

Create Panda Series

There are many methods of creating Pandas Series. Below, we will go through some of different ways of creating Series.

  1. From a list
  2. From a dictionary
  3. From numpy.ndarray
  4. From a file

Convert a Python list to Pandas Series

  1. First of all, we will import pandas library to our programme by using import pandas as pd
  2. Now we will convert Python list to pandas series using the Series(name of list) constructor

The above code will produce this output

0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64

As you can see that an index has automatically been assigned to data. We can also specify our own index, in example below

 

The above code will produce this output

Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64

Convert a Python dictionary to Pandas Series

The above code will produce this output

Jan 31
Feb 28
Mar 31
Apr 30
May 31
Jun 30
dtype: int64

Convert a numpy array to Pandas Series

The above code will produce this output

[ 1 3 5 7 9 11 13 15 17 19]

0 1
1 3
2 5
3 7
4 9
5 11
6 13
7 15
8 17
9 19
dtype: int64

Vectorised operations on Pandas Series

Just like a numpy array, vectorised operations can work on Pandas Series  as well. For example, we can add, multiply and divide all elements of a pandas Series to a number, see an example below

The above code will produce this output

0 31
1 28
2 31
3 30
4 31
5 30
dtype: int64

0 62
1 56
2 62
3 60
4 62
5 60
dtype: int64

0 32
1 29
2 32
3 31
4 32
5 31
dtype: int64

0 961
1 784
2 961
3 900
4 961
5 900
dtype: int64

30.166666666666668

Pandas DataFrames

DataFrame is a two-dimensional labelled array.

  1. Columns can be heterogeneous (different data type) like a spreadsheet or SQL table
  2. Data will have x-index and y-index

Create Pandas DataFrame

There are many methods of creating Pandas DataFrame. Below, we will go through some of methods. We can create a Pandas DataFrame:

  1. From a list
  2. From a dictionary
  3. From a Series
  4. From 2D numpy.ndarray
  5. From a file like text, Execel, CSV or database

Import data from file to Pandas DataFrame

Pandas have many functions to read data from external files. In example below we are using read_excel function to read data from an external Excel file that show employee attrition.

The columns attribute will show us all the column names.

The index attribute will show us all the index names.

The values attribute will show us all the values.

 

The above code will produce this output

Index([‘Age’, ‘Attrition’, ‘BusinessTravel’, ‘DailyRate’, ‘Department’,
‘DistanceFromHome’, ‘Education’, ‘EducationField’, ‘EmployeeCount’,
‘EnvironmentSatisfaction’, ‘Gender’, ‘HourlyRate’, ‘JobInvolvement’,
‘JobLevel’, ‘JobRole’, ‘JobSatisfaction’, ‘MaritalStatus’,
‘MonthlyIncome’, ‘MonthlyRate’, ‘NumCompaniesWorked’, ‘Over18’,
‘OverTime’, ‘PercentSalaryHike’, ‘PerformanceRating’,
‘RelationshipSatisfaction’, ‘StandardHours’, ‘StockOptionLevel’,
‘TotalWorkingYears’, ‘TrainingTimesLastYear’, ‘WorkLifeBalance’,
‘YearsAtCompany’, ‘YearsInCurrentRole’, ‘YearsSinceLastPromotion’,
‘YearsWithCurrManager’],
dtype=’object’)