Book Summary: Tidy First? by Kent Beck
Kent Beck’s Tidy First? is a concise and engaging read, outlining several “tidyings”–small code improvements–that make software easier to understand and more adaptable to future changes. He emphasizes that “software design enables change” and that even small design improvements can facilitate smoother modifications.
Key tidyings discussed in Part I include:
Guard Clauses: Within routines, move “guards” or “preconditions” to the top and return early. Avoid too many guard clauses within a routine.
Streaming Reads with Python and Google Cloud Storage
In data processing, efficiency and reliability are paramount. As a data engineer, you’ll often need to read files in resource constrained environments. One common approach to reading a file is to stream the file and process it in smaller chunks. I recently came across a way to accomplish this using Google Cloud Storage (GCS), Python, and a CRC32C checksum (to verify the file’s integrity). Some reasons why this approach could be useful and why this post exists:
Keyset pagination in PostgreSQL like a pro
Why keyset pagination With infinite scrolling tables on websites, keyset pagination is a technique to provide approximate constant time access to subsequent pages as a user scrolls. This approach can be implemented with data stored in a relational database like PostgreSQL. It is more complex to implement than a simple approach like LIMIT + OFFSET pagination but minimizes slower query times as you scroll many pages into a result set. The database doesn’t have to load the entire result set, sort it, and then return the specified limit from the given offset.
BigQuery: An interactive analytics benchmark
If you have operational data sitting in BigQuery that powers dashboards through tools like Tableau, Looker, or Apache Superset, putting an exploratory analytics tool on top of your BigQuery datasets can enable business and technical users to interact with the data in an interactive, exploratory fashion, and performance is surprisingly good. Using a standard dataset of varying sizes, an automated test suite ran over the data simulating “slice and dice” with concurrent users and performance of BigQuery was measured.
Snowflake supports interactive analytics at scale
With a proliferation of massively parallel processing (MPP) database technologies, like Apache Pinot, Apache Druid, and ClickHouse, there are no shortage of blog posts on the Internet explaining how these technologies are the only ones capable of supporting interactive analytics on large data volumes. That is not the case. Benchmark tests on Snowflake’s platform with wide, denormalized datasets and concurrent query access patterns show that Snowflake offers reasonably fast query performance on large datasets when queried in an iterative, ad-hoc fashion.