The amount of available data collected every two years roughly doubles, mimicking Moore's Law. Data on the magnitude of petabytes and zetabytes. As a result, big data, and data in general, has become a trend in business and everyday life. This is because data is an underutilized resource. Everyone has it, but few are capable of capitalizing on it.
To fully realize the potential of data, it must tell a good story. Humans understand and process information best in narrative form, which is why storytelling has been around as long as humans have. When working with data, this can seem difficult as few people associate spreadsheets with a good story. Hopefully after this tech briefing that will change.
As part of my senior thesis I am integrating data-driven storytelling into college media by doing data journalism for the Daily Wildcat. This combines two of my passions and academic activities: working with data (MIS) and story (journalism). Traditional journalism focuses on human sources, while data journalism uses data as the primary source and humans as secondary sources. Below are links to some of the stories I've published this year:
FOIA-ing for police reports to find where parties are taking place around campus by creating a red tags heat map
7 Data Visualizations that explain the Sean Miller Era of UA basketball
Analyzing 22,000 on campus parking citations find the best place to park illegally
Fact checking the proponents and opponents of Prop 205 (Recreational Marijuana) by digging into the data
Pac-12 Undergrad Ethnicities Viz
To tell a story with data requires going through a process which mimics data science:
- Ask an interesting question
- Get the data
- Explore the data
- Communicate and visualize the findings
Communicating Your First Data-Driven Story:
- Find your subject
- Pick a topic that has the potential for a good story
- Ask an interesting question that people want to know the answer to
- Find your data
- Define your terms and find the associated
- E.g. How to get away with parking on campus → parking citations
- Partying at UA → Red tags
- Sean Miller is a good coach but what makes him elite → KenPom, team stats vs D1 average
- Living in the age of data means an abundance of publicly available data. FOIA-ing public institutions is also an option
- Clean your data
- Get your data in a format it can be analyzed to relate back to your story subject
- Normalizing the data
- Pareto Principle: 80% of the time will be preparing the data and 20% will be analyzing and visualizing it
- Find the answer to your question
- Present your data
- VISUALIZE
- Humans are visual creatures, and this helps transform raw data into intuitive information
- Check your story for the following:
- Audience
- Lede
- Nut graf
- Reference point
- New understanding from data
Following this process will ensure an effective communication of a data-driven story. If you're more interested in this and data journalism, here are some more infomation to check out:
Sites to Follow:
- 538 - The standard of data journalism at the moment
- /r/dataisbeautiful - Lots of interesting content from around the web aggregated here.The Upshot - The NYT data section
- QZ - International site that does high quality work
- Flowing Data - Aggregate of different data stories
- The Economist - Usually has good visualizations that aid reader understanding
- Polygraph - Really cin-depthepth data stories on pop culture
Stories/Visualizations:
Tools to Use:
- Google Sheets or Excel
- This is where the majority of people will working with data time will likely be spent so learn to love it
- Text to Columns
- Remove Duplicates
- VLookup
- Pivot Tables
- Index + Match
- Great, accessible data visualization tool that is a great skill many companies are looking for
- This is what I’ve used for more detailed mapping
- R Programming
- Useful statistical programming language. If you’re very interested in this stuff, R is worth your while
- Python
- Python is the second best option for everything
- Pandas, NumPy, Matplotlib, ggplot, sci-kitlearn