NFLsankeyBlogImage

Data Analysis of USAToday’s NFL Arrest database: 15 Surprising (and Scary) Insights

As soon as I learned that USA Today had released on open database of NFL player arrests (2000 to present), the data scientist in me thought, “I imagine there are some interesting patterns in there.” Rather than wondering, I downloaded it and dived right in.

The arrest data is easily readable, but lacks some important items (such as the age of the player at the time of arrest). As such, I decided to mash-up the data with two other sources: DOB, Height and Weight data from NFL.com and the strength and speed data from the NFL Combine. This would let me explore some of the more interesting (and potentially controversial) claims I heard in many TV interviews about the effect of increases in player size and strength had on aggression and crime.

My findings

Here are my findings from analyzing the data:

  1. Arrest frequency is NOT increasing. It is actually down from a really bad spate from 2006-2008
  2. NFL players, in general, have a one-third less likelihood of being arrested than everyday US residents. They have 15x the median US income and 3x the college graduation rate.
  3. However, many of those who are arrested are arrested many times throughout their career. 124 people were arrested more than once. One player was arrested 9 times. Sixty-five arrests were for multiple counts, across multiple criminal charges.
  4. Guilty verdicts (conviction, plea, or plea agreement) are the most common legal outcome. They occur almost 7x more frequently than Acquittals
  5. Nevertheless the most common action taken by NFL teams in response to arrests is “No Response.” This occurs 84% of the time
  6. Two-thirds of arrests occur off-season. However over 99% are arrest of players under contract. Free agent arrests are rare (although all of them later signed onto teams)
  7. Three teams (Minnesota, Cincinnati and Denver) have seen double the “normal” number of arrests per team
  8. Four criminal charges (DUI, Drugs, Domestic Violence and Assault) represent 60% of all arrests.
  9. Six charges (DUI, Drugs, Domestic Violence, Assault, Gun Charges and Disorderly Conduct) represent 80% of all arrests. Each of these has a single team with more arrests than any other.
  10. Of the most frequent charges, conviction rate varied enormously. DUIs had the highest conviction rate; Domestic Violence the lowest. While Domestic Violence pleas + convictions outcomes outnumbered acquittals 10:1, the vast majority of these cases were dropped or resolved in Diversion Programs
  11. The median arrested NFL player is: 25 years, 6 months old; is 6’6” tall, weighs 230 lbs., can run the 40-yd dash in 4.61 seconds and can bench press 225 lbs. 21times.
  12. However, age was not a factor in arrest or criminal charge
  13. Nor were height and weight—contrary to some public opinion
  14. Nor was speed
  15. However, while strength was not a statistical factor, an analysis of strength by criminal charge shows a scary pattern: those accused of Sexual Assault scored the lowest in the NFL Combine Bench Press strength test.

The rest of this post outlines the details of these findings, along with a range of charts and interactive visualizations highlighting data patterns and trends.

The data

  • 730 arrests between 2000 and the present (the database actually expanded by one entry a few days after launch to account for the arrest of Jonathan Dwyer)
  • These 730 arrests spanned 544 players (more on that below). Of these 544 players, 330 had publicly-available NFL Combine results
  • The arrests spanned 51 separate criminal charges (with some interesting concentrations, see below)

Here is what I found. (See the bottom of this post for notes on methodology and data sources.)

   Next: Arrest Frequency (and who was arrested nine times)

On the 45th Anniversary of the Moon Landing: 5 Lessons the Apollo’s Program Manager taught me at MIT

I originally posted a version of this on five years ago, on 40th Anniversary of the Apollo Moon landing. At that time, social media and smartphone were just starting to explode. Today, as social sharing and mobile are giving rise to IoT, these lessons from 1969 are perhaps even more important.

Putting things in perspective

It is easy to feel really proud of our accomplishments, whether we are scaling a consumer application a 1,000-fold in one year, rolling out a huge ERP program or even creating a new technology.  However these accomplishments pale in comparison to what the Apollo, Gemini, and Mercury Missions achieved 45 years ago. Imagine this scenario:

You are listening to the radio and the President announces that the country is going to put a man on the Moon by the end of the decade. Keep in mind that no one has ever even escaped low earth orbit–let alone escaped Earth’s gravity, executed Holman transfers AND navigated to another body. Now you have to implement the largest engineering project in history, while inventing not only technologies, but also whole fields of study. All under the watch of the press—and all completed within one decade.

This is inconceivable to most of us in our work today. It is inspirational.

One small step for man (Credit: NASA)
Success: One small step for man, one giant leap for mankind. (Credit: NASA)

My lucky exposure to the people of Apollo

At the time I studied aerospace engineering at MIT, we were lucky enough to have several veterans of the Apollo Program on staff as our instructors. Not only were they great instructors; they also could recount first-hand experiences of events that the rest of us could only read about in the history books.

One of these professors was Joe Shea, the original Program Manager of NASA’s Apollo Program (portrayed by Kevin Pollack on HBO’s excellent series, “From the Earth to the Moon”). Contrary to what that series depicted, it was Joe who came up with concept of splitting the Apollo Program into missions that achieved never-before-achieved technology marvels.

Joe is also considered by some a founder of the Systems Engineering profession (many consider him the greatest systems engineer who ever lived). This made him the perfect person to each the capstone class of the aerospace curriculum:  Systems Engineering (Fred Wilson of USV has written a great post on how fun Systems Engineering is and how important it is for engineering leadership). Every year, he would get a project from NASA and guide his students through all aspects of design, simulation, planning and even cost analysis. Our midterms and finals were real-life presentations to the Administrator of NASA.

Under Joe, I got to work on something called “Project Phoenix,” returning to the moon—but now with a re-usable capsule and landing four astronauts at the pole and keeping there for 30 days (a much harder prospect). In this project I learned about everything from active risk management to critical path costing to lifting bodies to Class-E solar flares. (How cool was that for a 20-year-old?)

Life lessons I learned from Joe

The technical things I learned from Joe got me my first job at Lockheed Martin (then GE Aerospace). It was great to be able to say that I had worked on a NASA program, helped create both a PDR (Preliminary Design Review) and CDR (Critical Design Review) and present elements of them to the Administrator of NASA in Washington.

However, I learned five much more important lessons – independent of aerospace or any other technology – that I have used in the eighteen twenty-three years since:

  1. Break Big Challenges into Small Parts. Any obstacle can be achieved if you break it down to smaller items. If these are too large, break them down again. Eventually you will get to things that have clear, straightforward paths for success. Essentially this is the engineer’s version of “a journey of a thousand miles begins with a single step”
  2. Know Your Stuff Inside and Out. You cannot be a technology leader who only manages from above. You must understand how the components work. This is the only way you will see problems before they happen. Remember, you are the leader who is the only one positioned to connect the “Big Picture” to the execution details.
  3. S#!% Happens. Things break. Schedules are late. People leave the project. Plan for this. Ask yourself every week what can go wrong. Put contingency plans together to address the biggest or most likely of these. Today, this is done in everything from Risk Management to DevOps.
  4. There is No Such Thing as Partial Credit. Yes, unlike a rocket, you can “back out” (essentially un-launch) software. However, the costs of this type of failure are enormous: not only does it cost 3-5x more to back-out, fix and regression test changes, it also frequently results in lost revenue and customers. Get things right in development – then certify them in testing (not the other way around). Don’t count on being able to “back-out” after a failed launch–this will be come more and more true as we push software to millions of “things” comprising IoT. Joe hammered a lesson into our heads with a chilling story: when people forgot this and rushed three astronauts died during a basic systems test on the Apollo 1.
  5. Take Ownership. If you are the leader, you are responsible for the team’s or product’s success. If you are a line manager, you are not only responsible for your area but are being relied upon by your peers for success. If you are a hands-on analyst or engineer you are actually delivering the work that leads to success. In all cases, ensure you do your job right, ask for help when you need it and never lie or hide anything.

Five really important lessons. I am grateful I had the opportunity to learn them before I entered the full-time career work force. I try to “pay this back” by teaching these lessons and concepts everywhere I go.

Before I forget…

Thank you to the men and women of Apollo. Thank you also to the men and women of Gemini and Mercury (it is easy to forget them on this day). You achieved miracles on a daily basis and inspired whole generations of scientists and engineers.

john_oliver_net_neutrality_rant

Twitter traffic jams in Washington, created by… John Oliver

Note: This post was first published as Twitter Sensors: Detecting the Traffic Jam in Washington Caused by… John Oliver on the Savi Technology Blog.

Summary: In the first week of June, 20% of the Tweets about traffic, delays and congestion by people around the Washington Beltway were caused by John Oliver’s “Last Week Tonight” segment about Net Neutrality.

At work, we are always exploring a wide range of sensors to obtain useful insights that can used to make work and routine activities faster, more efficient and less risky. One of our Alpha Tests is examining use of “arrays” of high-targeted Twitter sensors to detect early indications of traffic congestion, accidents and other sources of delays. Specifically we are training our system how to use Twitter is a good traffic sensor (by good, in “data science speak” we are determining whether we can train a model for traffic detection that has a  good balance of precision and recall, and hence a good F1 Score). To do this, I setup a test bed around the nation’s second-worst commuter corridor: the Washington DC Beltway (our my backyard).

Earlier this month our array of geographic Twitter sensors picked up an interesting surge in highly localized tweets about traffic-related congestion and delays. This was not an expected “bad commuter-day”-like surge. The number of topic- and geographically-related tweets seen on June 4th was more than double the expected number for a Tuesday in June around the Beltway; the number seen during lunchtime was almost 5x normal.

So what was the cause? Before answering, it is worth taking a step back.

The folks at Twitter have done a wonderful job at not only allowing you to fetch tweets based on topics, hash tags and geographies. They have also added some great machine learning-driven processing to screen out likely spammers and suspect accounts. Nevertheless Twitter data, like all sensor data, is messy. It is common to see Tweets with words spelled wrong, words used out of context, or simply nonsensical Tweets. In addition, people frequently repeat the same tweets throughout the day (a tactic to raise social media exposure) and do lots of other things that you must train the machine to account for.

That’s why we use a Lambda Architecture to process our streaming sensor data (I’ll write about why everyone–from marketers to DevOps staff should be excited about Lambda architectures in a future post). As such, not only do use Complex Event Processing (via Apache Storm) to detect patterns as they happen; we also keep a permanent copy of all raw data that we can explore to discover new patterns and improve our machine learning models).

That is exactly what we did as soon as we detected the surge. Here is what we found: the cause of the traffic- and congestion-related Twitter surge around the Beltway was… John Oliver:

  1. In the back half of June 1st’s episode of “Last Week Tonight” (HBO, 11pm ET), John Oliver had an interesting 13-minute segment on Net Neutrality. In this segment he encouraged people to visit the FCC website and comment on this topic.
  2. Seventeen hours later, the FCC tweeted that “[they were] experiencing technical difficulties with [their] comment system due to heavy traffic.” They tweeted a similar message 74-minutes later.
  3. This triggered a wave of re-tweets and comments about the outage in many places. Interestingly this wave was delayed in the Beltway. It surged the next day, just before lunchtime in DC, continuing throughout the afternoon. The two spikes were at lunchtime and just after work . Evidently, people are not re-tweeting while working. The timing of the spikes also reveals some interesting behavior patterns on Twitter use in DC.
  4. By 4am on Wednesday the surge was over. People around the Beltway were back to their normal tweeting about traffic, construction, delays, lights, outages and other items confounding their commute.

Of course, as soon as we saw the new pattern, we adjusted our model to account for this pattern. However, we thought it would be interesting to show in a simple graph how much “traffic on traffic, delays and congestion” Mr. Oliver induced in the geography around the Beltway for a 36-hour period. Over the first week of June, one out of every five Tweets about traffic, delays and congestion by people around the Beltway were not about commuter traffic, but instead around FCC website traffic caused by John Oliver:

Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)
Tweets from people geographically Tweeting around the Washington Beltway on traffic, congestion, delays and related frustration for first week of June. (Click to enlarge.)

Obviously, a simple count of tweets is a gross measure. To really use Twitter as a sensor, one needs to factor in many other variables: use text vs. hash-tags, tweets vs. mentions and re-tweets, the software client used to send the tweet (e.g., HootSuite is less likely to be a good source for accurate commuter traffic data); the number of followers the tweeter has (not a simple linear weighting) and much more. However, the simple count is simple first-order visualization. It also makes interesting “water-cooler conversation.”

5 points where tech balances between life and business