Surfing the Data Pipeline with Python

Surfing the Data Pipeline with Python#

Jonathan Kropko
School of Data Science
University of Virginia

Version 0.2.4

The pipeline refers to all of the steps needed to go from raw, messy, original data to data that is ready for any kind of analysis. In the real world, data is almost never ready to be analyzed without a great deal of work to prepare the data first. The goal of this book is to make this huge part of the job easier, faster, less frustrating, and more enjoyable for you. The techniques we will discuss are not the only ways to accomplish a task, but they represent fast and straightforward ways to do the work.

The book was written entirely without the assistance of AI. All errors are attributable entirely to the flawed human author.

Get Started

1. Getting Yourself Unstuck

Get Data

Wrangle Data

Explore and Communicate Data

Appendices

Virtual Environments and Containers

This website is free to use, and is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.

If you catch any typos or mistakes, you can let me know by submitting an issue on this book’s GitHub repository. Or you can make changes yourself to the text by issuing a pull request. If you do, I’ll make sure you get thanked on this page.

This book is dedicated to Cypress.

My sincere thanks to Cedric Harper, Nick Clifford, Brian Wright, Yash Tekriwal, Sucheta Soundarajan, Pete Alonzi, Raf Alvarado, Nada Basit, Youssef Abubaker, David Xu, and Mike Powers for help pulling material together and comments on the book.

The book’s icon is ”Surfing Dunedin” by Dunedin NZ, and is licensed under CC BY-ND 2.0.