Here’s an example of how to load and analyze the “tips” dataset using both pandas and statsmodels libraries in Python:
import pandas as pd
import statsmodels.api as sm
# Load the tips dataset from seaborn library
tips = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# Fit a linear regression model to predict tip amount based on total bill
X = tips['total_bill']
y = tips['tip']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
# Print the summary of the regression model
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: tip R-squared: 0.457 Model: OLS Adj. R-squared: 0.454 Method: Least Squares F-statistic: 203.4 Date: Wed, 22 Feb 2023 Prob (F-statistic): 6.69e-34 Time: 22:02:24 Log-Likelihood: -350.54 No. Observations: 244 AIC: 705.1 Df Residuals: 242 BIC: 712.1 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.9203 0.160 5.761 0.000 0.606 1.235 total_bill 0.1050 0.007 14.260 0.000 0.091 0.120 ============================================================================== Omnibus: 20.185 Durbin-Watson: 2.151 Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.750 Skew: 0.443 Prob(JB): 6.35e-09 Kurtosis: 4.711 Cond. No. 53.0 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In this example, we first load the tips dataset using the pandas read_csv()
function. Then, we extract the relevant columns (total bill and tip amount) and fit a simple linear regression model using the OLS()
function from statsmodels. Finally, we print the summary of the model using the summary()
method of the fitted model object.
Note that we also add a constant term to the predictor variable X
using the add_constant()
function from statsmodels. This is necessary because the OLS function does not include an intercept term by default.