### Dataset Documentation #### Overview - **Train DataFrame Shape:** (4542343, 41) - **Test DataFrame Shape :** (1946719, 41) - **Data Types:** - `object`: Categorical or string data - `float64`: Continuous numerical data - `int8`: Small integers - `int16`: Medium-sized integers - `int32`: Larger integers - `int64`: Large integers - `category`: Categorical data with ordered or unordered categories #### Features 1. **MONTH** (`object`) - **Description:** Month of the year when the flight occurred (e.g., 'January', 'February'). - **Values:** 12 months (e.g., 'January', 'February', etc.) 2. **DAY_OF_WEEK** (`object`) - **Description:** Day of the week when the flight occurred (e.g., 'Monday', 'Tuesday'). - **Values:** 7 days of the week (e.g., 'Monday', 'Tuesday', etc.) 3. **DISTANCE** (`float64`) - **Description:** Distance of the flight in miles. - **Values:** Continuous values (e.g., 500.0, 1200.0, etc.) 4. **DISTANCE_GROUP_DESCRIPTION** (`int8`) - **Description:** Categorized distance group. - 1: 'Very Short Distance', - 2: 'Short Distance', - 3: 'Moderate Distance', - 4: 'Moderate to Long Distance', - 5: 'Long Distance', - 6: 'Very Long Distance', - 7: 'Extended Distance', - 8: 'Far Distance', - 9: 'Distant Location', - 10: 'Remote Distance', - 11: 'Very Remote Distance' - **Values:** Categories. 5. **SEGMENT_NUMBER** (`int8`) - **Description:** Segment number is assigned to each flight based on its departure time (DEP_TIME) for each combination of TAIL_NUM (aircraft identifier) and DAY_OF_MONTH (specific day). The first flight of the day for a given aircraft will have a segment number of 1, the second will have 2, and so on. This is useful for identifying the order in which flights occur on a particular day for each aircraft. - **Values:** Integer values. 6. **CONCURRENT_FLIGHTS** (`int64`) - **Description:** This contains the number of flights departing from a given airport (ORIGIN_AIRPORT_ID) during a particular day (DAY_OF_MONTH) and departure time block (DEP_TIME_BLK). This value represents the total flights operating concurrently during that specific time window at that airport. - **Values:** Integer values. 7. **NUMBER_OF_SEATS** (`int16`) - **Description:** Number of seats available on the flight. - **Values:** Integer values. 8. **CARRIER_NAME** (`object`) - **Description:** Name of the airline carrier (e.g., 'American Airlines', 'Delta'). - **Values:** Airline names. 9. **AIRPORT_FLIGHTS_MONTH** (`int64`) - Engineered feature dropped as its a highly correlated feature - - **Description:** Represents the total number of flights departing from a specific airport (ORIGIN_AIRPORT_ID) during the entire month. It gives a count of how many flights took off from each airport, regardless of the airline or other factors, providing an overall flight volume for each airport for the month. This is useful for analyzing airport traffic, identifying busy airports, and studying airport-specific demand over a monthly period. - **Values:** Integer values. 10. **AIRLINE_FLIGHTS_MONTH** (`int64`) - **Description:** Represents the total number of flights operated by a specific airline (OP_UNIQUE_CARRIER) during the entire month, across all airports. It gives a count of flights for each airline, to analyze how many flights each airline has operated in the month, regardless of origin or destination. This is useful for analyzing airline activity, comparing airline sizes, or determining which airlines have the most operations over a given period. - **Values:** Integer values. 11. **AIRLINE_AIRPORT_FLIGHTS_MONTH** (`int64`) - **Description:** Represents the number of flights operated by a specific airline (OP_UNIQUE_CARRIER) from a specific airport (ORIGIN_AIRPORT_ID) during the month. It gives a combined count of flights for each airline and airport pair, providing insight into how active each airline is at each airport for the month. This is useful for studying the distribution of flights between airlines and airports, identifying airline dominance at particular airports, and understanding airport-airline relationships over a monthly period. - **Values:** Integer values. 12. **AVG_MONTHLY_PASS_AIRPORT** (`int64`) - **Description:** Represents the average monthly number of passengers at a specific airport. The monthly_airport_passengers datadrame is created by grouping the passengers data by ORIGIN_AIRPORT_ID and summing the total passengers (REV_PAX_ENP_110) for each airport. This data is merged with monthly_data, adding the total passengers for each airport. The AVG_MONTHLY_PASS_AIRPORT feature is then calculated by dividing the total passengers by 12 to obtain the average monthly passengers per airport. - **Values:** Integer values. 13. **AVG_MONTHLY_PASS_AIRLINE** (`int64`) - Engineered feature dropped as its a highly correlated feature - - **Description:** Represents the average monthly number of passengers for a specific airline. The monthly_airline_passengers dataframe is created by grouping the passengers data by airline (OP_UNIQUE_CARRIER) and summing the total passengers for each. This data is merged with monthly_data, and the AVG_MONTHLY_PASS_AIRLINE feature is calculated by dividing the total passengers by 12 to obtain the average monthly passengers per airline - **Values:** Integer values. 14. **FLT_ATTENDANTS_PER_PASS** (`float64`) - **Description:** Represents employee statistics and is the number of flight attendants per passenger on average. Calculated as the ratio of the number of flight attendants involved in passenger handling (PASSENGER_HANDLING) to the number of passengers enplaned (REV_PAX_ENP_110_y). - **Values:** Continuous values. 15. **GROUND_SERV_PER_PASS** (`float64`) - **Description:** Represents employee statistics and is the number of ground service personnel per passenger on average. Calculated as the ratio of ground service and administrative staff (PASS_GEN_SVC_ADMIN) to the number of passengers enplaned (REV_PAX_ENP_110_y). - **Values:** Continuous values. 16. **PLANE_AGE** (`int32`) - **Description:** Age of the plane in years. PLANE_AGE is created by subtracting the MANUFACTURE_YEAR from the current year (2019) which is when the flight delays data-set is coming from. - **Values:** Integer values. 17. **DEPARTING_AIRPORT** (`object`) - **Description:** Represents the airport name from which the current flight is departing from. - **Values:** Airport name. 18. **LATITUDE** (`float64`) - **Description:** Latitude coordinate of the airport the current flight is departing from, i.e, DEPARTING_AIRPORT. - **Values:** Continuous values. 19. **LONGITUDE** (`float64`) - **Description:** Longitude coordinate of the airport the current flight is departing from, i.e, DEPARTING_AIPRORT. - **Values:** Continuous values. 20. **PREVIOUS_AIRPORT** (`object`) - **Description:** This is the previous airport from which the aircraft (identified by the same TAIL_NUM) departed before the current flight, on the same day. - **Values:** Airport names. 21. **PRCP** (`float64`) - **Description:** Precipitation (in inches) at the airport on the day of the flight. - **Values:** Continuous values. 22. **SNOW** (`float64`) - **Description:** Snowfall (in inches) at the airport on the day of the flight. - **Values:** Continuous values. 23. **SNWD** (`float64`) - **Description:** Snow depth (in inches) at the airport on the day of the flight. - **Values:** Continuous values. 24. **TMAX** (`float64`) - **Description:** Maximum temperature (in degrees Fahrenheit) at the airport on the day of the flight. - **Values:** Continuous values. 25. **AWND** (`float64`) - **Description:** Average wind speed (in miles per hour) at the airport on the day of the flight. - **Values:** Continuous values. 26. **ELAPSED_TIME_DIFF** (`float64`) - Dropped in final data-set as it leads to data leakage - - **Description:** Difference between actual elapsed time and planned elapsed time. - **Values:** Continuous values. 27. **DEP_DELAY** (`float64`) - Dropped in final data-set as it leads to data leakage - - **Description:** Delay in departure time for a flight. - Positive Values: Indicates a departure delay. - Negative Values: Indicates an early departure. - Zero Values: Indicates that the flight departed exactly at the scheduled time. - **Values:** Continuous values. 28. **ARR_DELAY** (`float64`) - Dropped in final data-set as it leads to data leakage - - **Description:** Delay in arrival time for a flight - Positive Values: Indicates an arrival delay. - Negative Values: Indicates an early arrival. - Zero Values: Indicates that the flight arrived exactly at the scheduled time. - **Values:** Continuous values. 29. **DELAY_CLASS_NUMERIC** (`int64`) - **Description:** Numeric representation of the delay class. - Class 0: On-time departure and arrival: 0. - Class 1: On-time departure, delayed arrival: 1. - Class 2: Delayed departure, on-time arrival: 2. - Class 3: Both departure and arrival delayed: 3. - **Values:** Integer categories. 30. **CARRIER_HISTORICAL** (`float64`) - **Description:** Historical average delay rates of each carrier per month. - **Values:** Continuous values. 31. **DEP_AIRPORT_HIST** (`float64`) - **Description:** Historical average delay rates for flights departing from specific airports per month. - **Values:** Continuous values. 32. **PREV_AIRPORT_HIST** (`float64`) - **Description:** Historical average delay rate for the airport from which the aircraft arrived before the current departure. - **Values:** Continuous values. 33. **DAY_HISTORICAL** (`float64`) - **Description:** Historical average delay for the specific day of the week, aggregated by month. - **Values:** Continuous values. 34. **DEP_BLOCK_HIST** (`float64`) - **Description:** Historical average delay for different departure time blocks, aggregated by month. - **Values:** Continuous values. 35. **SEASON** (`object`) - **Description:** Season of the year when the flight occurred (e.g., 'Winter', 'Spring'). - Month 12: 'Winter', 1: 'Winter', 2: 'Winter'. - Month 3: 'Spring', 4: 'Spring', 5: 'Spring'. - Month 6: 'Summer', 7: 'Summer', 8: 'Summer'. - Month 9: 'Fall', 10: 'Fall', 11: 'Fall'. - **Values:** Four seasons ('Winter', 'Spring', 'Summer', 'Fall'). 36. **DEP_PART_OF_DAY** (`category`) - **Description:** Part of the day when the departure occurred (e.g., 'Early Morning & Late Night', 'Morning'). - 'Early Morning & Late Night', # 0001-0559. - 'Morning', # 0600-1159. - 'Afternoon', # 1200-1659. - 'Evening', # 1700-1959. - 'Night' # 2000-2359. - **Values:** Time blocks of the day. 37. **ARR_PART_OF_DAY** (`category`) - **Description:** Part of the day when the arrival occurred (e.g., 'Early Morning & Late Night', 'Morning'). - 'Early Morning & Late Night', # 0001-0559. - 'Morning', # 0600-1159. - 'Afternoon', # 1200-1659. - 'Evening', # 1700-1959. - 'Night' # 2000-2359. - **Values:** Time blocks of the day. 38. **FLIGHT_DURATION** (`float64`) - **Description:** Represents the planned duration of current flight in minutes. It's derived from two sources: - CRS_ELAPSED_TIME: The originally scheduled flight duration provided in the dataset. - CALCULATED_DURATION: A duration calculated based on the difference between CRS_ARR_TIME (scheduled arrival time) and CRS_DEP_TIME (scheduled departure time). - Final FLIGHT_DURATION is determined by comparing these two values: - If the absolute difference between CRS_ELAPSED_TIME and CALCULATED_DURATION is 5 minutes or less, CRS_ELAPSED_TIME is used. - If the difference is greater than 5 minutes, CALCULATED_DURATION is used 39. **FLIGHT_DURATION_CATEGORY** (`object`) - **Description:** Categorical variable derived from FLIGHT_DURATION. - It classifies current flights into three categories based on their duration: - Short: Flights lasting less than 60 minutes (under 1 hour) - Medium: Flights lasting between 60 and 179 minutes (1 to 3 hours) - Long: Flights lasting 180 minutes or more (3 hours or more) 40. **PREVIOUS_DURATION** (`float64`) 41. **PREVIOUS_DURATION_CATEGORY** (`object`) 42. **PREVIOUS_ARR_DELAY** (`float64`) 43. **PREVIOUS_DEP_DELAY** (`float64`) - Dropped due to high VIF