Computer Science, asked by mohi5778, 11 months ago

_ filter data according to specific value or string

Answers

Answered by diyamanglani

Hey here's ur answer

Mark it as brainliest

Answered by arinwal

Answer:

here is you answer

n this article, we will cover various methods to filter pandas dataframe in Python. Data Filtering is one of the most frequent data manipulation operation. It is similar to WHERE clause in SQL or you must have used filter in MS Excel for selecting specific rows based on some conditions. In terms of speed, python has an efficient way to perform filtering and aggregation. It has an excellent package called pandas for data wrangling tasks. Pandas has been built on top of numpy package which was written in C language which is a low level language. Hence data manipulation using pandas package is fast and smart way to handle big sized datasets.

Examples of Data Filtering

It is one of the most initial step of data preparation for predictive modeling or any reporting project. It is also called 'Subsetting Data'. See some of the examples of data filtering below.

Select all the active customers whose accounts were opened after 1st January 2019

Extract details of all the customers who made more than 3 transactions in the last 6 months

Fetch information of employees who spent more than 3 years in the organization and received highest rating in the past 2 years

Analyze complaints data and identify customers who filed more than 5 complaints in the last 1 year

Extract details of metro cities where per capita income is greater than 40K dollars

filter pandas dataframe

Import Data

Make sure pandas package is already installed before submitting the following code. You can check it by running !pip show pandas statement in Ipython console. If it is not installed, you can install it by using the command !pip install pandas.

We are going to use dataset containing details of flights departing from NYC in 2013. This dataset has 32735 rows and 16 columns. See column names below. To import dataset, we are using read_csv( ) function from pandas package.

['year', 'month', 'day', 'dep_time', 'dep_delay', 'arr_time',

'arr_delay', 'carrier', 'tailnum', 'flight', 'origin', 'dest',

'air_time', 'distance', 'hour', 'minute']

import pandas as pd

df = pd.read_csv("https://dyurovsky.github.io/psyc201/data/lab2/nycflights.csv")

Filter pandas dataframe by column value

Select flights details of JetBlue Airways that has 2 letters carrier code B6 with origin from JFK airport

Method 1 : DataFrame Way

newdf = df[(df.origin == "JFK") & (df.carrier == "B6")]

newdf.head()

Out[23]:

year month day dep_time ... air_time distance hour minute

7 2013 8 13 1920 ... 48.0 228.0 19.0 20.0

10 2013 6 17 940 ... 50.0 264.0 9.0 40.0

14 2013 10 21 1217 ... 46.0 266.0 12.0 17.0

23 2013 7 7 2310 ... 223.0 1626.0 23.0 10.0

35 2013 4 12 840 ... 186.0 1598.0 8.0 40.0

[5 rows x 16 columns]

Filtered data (after subsetting) is stored on new dataframe called newdf.

Symbol & refers to AND condition which means meeting both the criteria.

This part of code (df.origin == "JFK") & (df.carrier == "B6") returns True / False. True where condition matches and False where the condition does not hold. Later it is passed within df and returns all the rows corresponding to True. It returns 4166 rows.

Method 2 : Query Function

In pandas package, there are multiple ways to perform filtering. The above code can also be written like the code shown below. This method is elegant and more readable and you don't need to mention dataframe name everytime when you specify columns (variables).

newdf = df.query('origin == "JFK" & carrier == "B6"')

Method 3 : loc function

loc is an abbreviation of location term. All these 3 methods return same output. It's just a different ways of doing filtering rows.

newdf = df.loc[(df.origin == "JFK") & (df.carrier == "B6")]

Filter Pandas Dataframe by Row and Column Position

Suppose you want to select specific rows by their position (let's say from second through fifth row). We can use df.iloc[ ] function for the same.

Indexing in python starts from zero. df.iloc[0:5,] refers to first to fifth row (excluding end point 6th row here). df.iloc[0:5,] is equivalent to df.iloc[:5,]

df.iloc[:5,] #First 5 rows

df.iloc[1:5,] #Second to Fifth row

df.iloc[5,0] #Sixth row and 1st column

df.iloc[1:5,0] #Second to Fifth row, first column

df.iloc[1:5,:5] #Second to Fifth row, first 5 columns

df.iloc[2:7,1:3] #Third to Seventh row, 2nd and 3rd column

Difference between loc and iloc function

loc considers rows based on index labels. Whereas iloc considers rows based on position in the index so it only takes integers. Let's create a sample data for illustration

import numpy as np

x = pd.DataFrame({"col1" : np.arange(1,20,2)}, index=[9,8,7,6,0, 1, 2, 3, 4, 5])

col1

9 1

8 3

7 5

6 7

0 9

1 11

2 13

3 15

4 17

5 19

Previous Question

Next Question