And we’re back, closing out our series on parsing data in Python using pandas!

We’ll be picking up from where we left off in Part 1. Last time we covered the `DataFrame` and `Series` classes, as well as basic selection and indexing. This time we’ll focus on more general forms of selection, hierarchical grouping, numerical aggregation, how to handle missing data, and even plotting!

We’ll continue our strategy of learning by doing—answering questions about our dataset of members of the Congress of the United States.

## Question 4: How do gender and party relate?

Previously, we just counted legislators which were equivalent up to their level in some particular categorical variable—in pandas terms, rows which had the same value at one particular column index. We grouped by a column, then aggregated by counting.

Suppose we want to know the gender distribution by political party. This too can be thought of as a group-then-count question.

Before, we grouped by a single column index. Now we want to group by multiple column indices. With this comes a new notion: that of a non-scalar index, or `MultiIndex`. Instead of passing a column index (here, a scalar `str`) as an argument to `groupby`, we pass a `list` of `str`:

Once again, we have a `GroupBy` object. However, let’s look at the keys of the underlying `dict`, which lives at `g.groups`:

Before, our keys were simply scalars. Now they are tuples (a technical requirement, as general Python `list`s are not hashable).

We can think of what is happening as grouping by a new “compound” categorical variable. We might called this `gender x party`, and and its levels would be the Cartesian product of levels of the component variables:

`gender` `party` `gender x party`
`F` `D` (`F`, `D`)
`F` `I` (`F`, `I`)
`F` `R` (`F`, `R`)
`M` `D` (`M`, `D`)
`M` `I` (`M`, `I`)
`M` `R` (`M`, `R`)

Comparing this to the value of `g.groups.keys()`, we can see that the sets of pairs mostly agree. The only exception is `('F', 'I')`, which does not appear in the set of keys. This is only because it doesn’t correspond to any rows, as there aren’t currently any female Independents in Congress.

We’ll try to count rows using the same invocation as before:

Now we’re presented with a friendly table of our data, hierarchically-organized, and aggregated by counts. How did pandas know what to do? Let’s figure it out by analyzing `s`.

It is a `Series`, as before. What does its `index` look like?

As we can see, a `MultiIndex` is a fairly complex object. All we really need to know is that it encodes the grouping we saw explicitly in the `dict` keys of the `GroupBy` object. This gives us some intuitive new indexing mechanisms:

### Gotchas

However, some aspects are less intuitive:

What’s going on here? Due to Python’s parsing, the first two expressions are equivalent. However, a Python `list` is “array-like”, which means that the indexing attempt gets interpreted as the query: “give me the values of `s` whose value at the first component of the `MultiIndex` (i.e. `"gender"`) is the string `'F'` or `'D'`”.

Since `'D'` is not a valid level for `'gender'`, including it in our query never causes any new rows to match. Thus `s[['F', 'D']]` ends ups being the same as `s[['F']]`.

This is contrary to what we might’ve meant in this context, namely: “give me the values of `s` indexed by `'F'` in the first component of the `MultiIndex` and then by `'D'` in the second component”. As we saw, that query would be expressed as `s['F']['D']`.

### Reaching in

Another nice feature of this multi-level grouping is that it makes it easy to perform within-group computations.

For example, as we saw last time, we can divide a `Series` by a scalar to get the result of a corresponding coordinate-wise division. This lets us obtain the ratio of each gender/party combination relative to the overall population of legislators:

But what if we want the ratios within each gender, i.e. relative to the overall count of lawmakers with a given gender? By simply making another, coarser `GroupBy` object, we can get this in one fell swoop using coordinate-wise array division:

From this, we can easily draw conclusions like “26.9% of women in Congress are Republican”, or “36.3% of men in Congress are Democrat”.

# Question 5: What are the descriptive statistics on legislator count by state?

Now that we’ve warmed up a bit and know how to group and count, let’s look at these counts across states. We might care to ask some standard statistical questions. Which states have the fewest legislators? The most? What’s the average number of legislators? The median? What’s the distribution look like?

These questions really only make sense for the House, not the Senate, since there are exactly 2 Senators per state. So, we’ll first select only House legislators from our original `DataFrame`:

Then we’ll perform a group-then-count:

This time, instead of working with a “raw” `Series`, we are going to convert it into a `DataFrame`. In doing so, we provide a name for the column that contains our new aggregate value:

One thing we can do with the `DataFrame` is call the `describe` method on our new column to get some quick summary statistics:

It’s also easy to get an explicit list of the top 5 states by House representation. We will sort by `reps` column, reverse the result with the Python slice syntax `[::-1]`, and then use the pandas `head` method to take the first 5 rows of the result:

Both `Series` and `DataFrame` objects respond to the `head` and `tail` methods, which optionally accept a number of rows (the default is 5) to return from the start (`head`) or end (`tail`) of the array of rows.

So, we also could’ve taken the `tail` and then reversed:

Or simply stated our preferred ordering as an argument to `sort`:

In truth, we could’ve done the same thing with the plain old `Series`. But the `DataFrame` can be a more convenient object to work with at times.

One example is with plotting. Though we can plot with `Series` objects, the extra labeling we usually have on a `DataFrame` makes producing a box plot of our state house count as easy as:

And the resulting image visually summarizes the result of `describe`:

We’ll see another example when answering the next (and last!) question.

# Question 6: What are the top 10 states with the highest proportion of female legislators?

This question can be answered with our previous techniques plus a bit of sorting. This time we’ll include senators.

We need to:

1. Group and count the women legislators by state
2. Find the proportion of women lawmakers within each state
3. Sort the resulting rows from greatest to least proportion

We’ve already figured out how to do step 1:

And computing step 2 is similar to the last example:

This time, however, we have some `NaN` values, since when we grouped, some states didn’t correspond to any rows. The pandas `fillna` method lets us replace `NaN` values with the appropriate value—in this case, 0:

Finally, we sort and take the top 10 states with the `head` method:

# Wrap-up

There is much, much more to pandas than the simple selecting, sorting, grouping, and plotting we’ve done so far. Furthermore, one of the benefits of using a popular library like pandas is that you can hook into the huge ecosystem built around it. For example, if we were to try to visualize our data set in a more interesting way, I would probably reach for something like the Seaborn library, rather than using the pandas defaults.

Despite everything we haven’t covered, my hope is that after reading this, you now feel like you really understand the central data structures that pandas gives us. With a solid conceptual background, you can dig into the (excellent) official pandas docs.

That’s all in this series for now. In the meantime, happy histogramming!

Main image from Flickr user internetarchivebookimages.