And we’re back, closing out our series on parsing data in Python using pandas!

We’ll be picking up from where we left off in Part 1. Last time we covered the DataFrame and Series classes, as well as basic selection and indexing. This time we’ll focus on more general forms of selection, hierarchical grouping, numerical aggregation, how to handle missing data, and even plotting!

We’ll continue our strategy of learning by doing—answering questions about our dataset of members of the Congress of the United States.

Question 4: How do gender and party relate?

Previously, we just counted legislators which were equivalent up to their level in some particular categorical variable—in pandas terms, rows which had the same value at one particular column index. We grouped by a column, then aggregated by counting.

Suppose we want to know the gender distribution by political party. This too can be thought of as a group-then-count question.

Before, we grouped by a single column index. Now we want to group by multiple column indices. With this comes a new notion: that of a non-scalar index, or MultiIndex. Instead of passing a column index (here, a scalar str) as an argument to groupby, we pass a list of str:

In [27]: g = df.groupby(['gender', 'party'])

In [28]: type(g)
Out[28]: pandas.core.groupby.DataFrameGroupBy

Once again, we have a GroupBy object. However, let’s look at the keys of the underlying dict, which lives at g.groups:

In [29]: g.groups.keys()
Out[29]: dict_keys([('F', 'R'), ('M', 'D'), ('M', 'I'), ('F', 'D'), ('M', 'R')])

Before, our keys were simply scalars. Now they are tuples (a technical requirement, as general Python lists are not hashable).

We can think of what is happening as grouping by a new “compound” categorical variable. We might called this gender x party, and and its levels would be the Cartesian product of levels of the component variables:

gender party gender x party
F D (F, D)
F I (F, I)
F R (F, R)
M D (M, D)
M I (M, I)
M R (M, R)

Comparing this to the value of g.groups.keys(), we can see that the sets of pairs mostly agree. The only exception is ('F', 'I'), which does not appear in the set of keys. This is only because it doesn’t correspond to any rows, as there aren’t currently any female Independents in Congress.

We’ll try to count rows using the same invocation as before:

In [30]: s = g.apply(len)

In [31]: s
gender  party
F       D         79
        R         29
M       D        157
        I          3
        R        272
dtype: int64

Now we’re presented with a friendly table of our data, hierarchically-organized, and aggregated by counts. How did pandas know what to do? Let’s figure it out by analyzing s.

In [32]: type(s)
Out[32]: pandas.core.series.Series

It is a Series, as before. What does its index look like?

In [33]: s.index
MultiIndex(levels=[['F', 'M'], ['D', 'I', 'R']],
           labels=[[0, 0, 1, 1, 1], [0, 2, 0, 1, 2]],
           names=['gender', 'party'])

As we can see, a MultiIndex is a fairly complex object. All we really need to know is that it encodes the grouping we saw explicitly in the dict keys of the GroupBy object. This gives us some intuitive new indexing mechanisms:

In [34]: s['F']
D    79
R    29
dtype: int64

In [35]: s['F']['D']
Out[35]: 79


However, some aspects are less intuitive:

In [36]: s['F', 'D']
Out[36]: 79

In [37]: s[('F', 'D')]
Out[37]: 79

In [38]: s[['F', 'D']]
gender  party
F       D        79
R        29
dtype: int64

What’s going on here? Due to Python’s parsing, the first two expressions are equivalent. However, a Python list is “array-like”, which means that the indexing attempt gets interpreted as the query: “give me the values of s whose value at the first component of the MultiIndex (i.e. "gender") is the string 'F' or 'D'”.

Since 'D' is not a valid level for 'gender', including it in our query never causes any new rows to match. Thus s[['F', 'D']] ends ups being the same as s[['F']].

This is contrary to what we might’ve meant in this context, namely: “give me the values of s indexed by 'F' in the first component of the MultiIndex and then by 'D' in the second component”. As we saw, that query would be expressed as s['F']['D'].

Reaching in

Another nice feature of this multi-level grouping is that it makes it easy to perform within-group computations.

For example, as we saw last time, we can divide a Series by a scalar to get the result of a corresponding coordinate-wise division. This lets us obtain the ratio of each gender/party combination relative to the overall population of legislators:

In [39]: s / len(df)
gender  party
F       D        0.146296
        R        0.053704
M       D        0.290741
        I        0.005556
        R        0.503704
dtype: float64

But what if we want the ratios within each gender, i.e. relative to the overall count of lawmakers with a given gender? By simply making another, coarser GroupBy object, we can get this in one fell swoop using coordinate-wise array division:

In [40]: count_by_gender_and_party = df.groupby(['gender', 'party']).apply(len)
In [41]: count_by_gender = df.groupby('gender').apply(len)
In [42]: count_by_gender_and_party / count_by_gender
gender  party
F       D        0.731481
        R        0.268519
M       D        0.363426
        I        0.006944
        R        0.629630
dtype: float64

From this, we can easily draw conclusions like “26.9% of women in Congress are Republican”, or “36.3% of men in Congress are Democrat”.

Question 5: What are the descriptive statistics on legislator count by state?

Now that we’ve warmed up a bit and know how to group and count, let’s look at these counts across states. We might care to ask some standard statistical questions. Which states have the fewest legislators? The most? What’s the average number of legislators? The median? What’s the distribution look like?

These questions really only make sense for the House, not the Senate, since there are exactly 2 Senators per state. So, we’ll first select only House legislators from our original DataFrame:

In [43]: house = df[df['chamber'] == 'house']

Then we’ll perform a group-then-count:

In [44]: s = df.groupby('state').apply(len)

This time, instead of working with a “raw” Series, we are going to convert it into a DataFrame. In doing so, we provide a name for the column that contains our new aggregate value:

In [45]: state_count = s.to_frame(name='reps')
AK        3
WI       10
WV        5
WY        3

One thing we can do with the DataFrame is call the describe method on our new column to get some quick summary statistics:

In [46]: state_count.describe()
count  56.000000
mean    9.642857
std     9.645401
min     1.000000
25%     4.000000
50%     7.000000
75%    11.000000
max    55.000000

It’s also easy to get an explicit list of the top 5 states by House representation. We will sort by reps column, reverse the result with the Python slice syntax [::-1], and then use the pandas head method to take the first 5 rows of the result:

In [47]: state_count.sort(columns='reps')[::-1].head()
CA       55
TX       38
FL       29
NY       29
PA       20

Both Series and DataFrame objects respond to the head and tail methods, which optionally accept a number of rows (the default is 5) to return from the start (head) or end (tail) of the array of rows.

So, we also could’ve taken the tail and then reversed:

In [48]: state_count.sort(columns='reps').tail()[::-1]

Or simply stated our preferred ordering as an argument to sort:

In [49]: state_count.sort(columns='reps', ascending=False).head()

In truth, we could’ve done the same thing with the plain old Series. But the DataFrame can be a more convenient object to work with at times.

One example is with plotting. Though we can plot with Series objects, the extra labeling we usually have on a DataFrame makes producing a box plot of our state house count as easy as:

In [50]: boxplot = state_count.plot(kind='box', grid=True)

In [51]: boxplot.figure.savefig('state_count.svg')

And the resulting image visually summarizes the result of describe:

We’ll see another example when answering the next (and last!) question.

Question 6: What are the top 10 states with the highest proportion of female legislators?

This question can be answered with our previous techniques plus a bit of sorting. This time we’ll include senators.

We need to:

  1. Group and count the women legislators by state
  2. Find the proportion of women lawmakers within each state
  3. Sort the resulting rows from greatest to least proportion

We’ve already figured out how to do step 1:

In [52]: women_by_state = df[df['gender'] == 'F'].groupby('state').apply(len)
In [53]: count_by_state = df.groupby('state').apply(len)

And computing step 2 is similar to the last example:

In [54]: gender_proportions_with_missing = (women_by_state / count_by_state).to_frame(name='proportion')

This time, however, we have some NaN values, since when we grouped, some states didn’t correspond to any rows. The pandas fillna method lets us replace NaN values with the appropriate value—in this case, 0:

In [55]: gender_proportions = gender_proportions.fillna(0)

Finally, we sort and take the top 10 states with the head method:

In [56]: sorted_states = gender_proportions.sort(columns='proportion', ascending=False)
In [57]: sorted_states.head(10)
GU       1.000000
AS       1.000000
VI       1.000000
DC       1.000000
NH       0.750000
ME       0.500000
HI       0.500000
WA       0.416667
CA       0.381818
AK       0.333333


There is much, much more to pandas than the simple selecting, sorting, grouping, and plotting we’ve done so far. Furthermore, one of the benefits of using a popular library like pandas is that you can hook into the huge ecosystem built around it. For example, if we were to try to visualize our data set in a more interesting way, I would probably reach for something like the Seaborn library, rather than using the pandas defaults.

Despite everything we haven’t covered, my hope is that after reading this, you now feel like you really understand the central data structures that pandas gives us. With a solid conceptual background, you can dig into the (excellent) official pandas docs.

That’s all in this series for now. In the meantime, happy histogramming!

Main image from Flickr user internetarchivebookimages.