The Cripplingly High Price of Modern Digital Experimentation

Back in the Long Boom the doors of the elevators at Yahoo! were covered in graphics showing a contemporary version of the home page code. That made a big impression on me when I visited them, as I had been experimenting with PHP, the same language that Yahoo! had been written in, and as I stared at those elevator doors, waiting for them to open, I realized that I could read and understand the code of what was then one of the most powerful web properties in the world.

Things were a lot simpler then, and we were too.

The setup back then was called the 'LAMP' stack. You took a (L)inux computer, installed on it an (A)pache web server and a (M)ySQL database, and wrote code for it, by hand, without frameworks or bullshit, in (P)HP or (P)erl. This was the era before cloud, when we all had servers in our basements. Once you had the server set up, and a connection to the internet (which back then was still admittedly not a trivial thing), prototyping was essentially free, limited only by your available time and imagination.

Look at us now. How different it is.

I recently thought I would use a small experiment in data visualization to learn a bit about big data. I like ETL and making datasets and gleaning insights from them, but my work has often been elsewhere, so I am less experienced in working with very large 'Big Data' datasets than I would like. A period of unemployment gives me some time, and a recent certification in machine learning some confidence, so I thought I would take the opportunity to experiment a little, and learn the best way possible, which is by doing.

My lack of practical understanding in this until now is now very obvious in hindsight, as it has been, for me, an unexpectedly expensive experience.

The data itself, as always, wants to be as close to free as it can be. An historical 20 year dataset of >750M rows cost me $350 and $19 for daily updates. I broke out PySpark in an AWS Glue notebook and started writing transformation code - the middle letter of ETL is my coding happy place - and it seemed a lot like when I was experimenting on the open web - until I got the first monthly bill and realized how quickly things added up when I was being charged by the hour for the development environment.

So I backed off into Python locally, got things running on my laptop, and started running an ingest experiment. This was when I realized the transformations I was performing were going to take a very, very, very long time to complete. That sinking feeling when you realize that what you are doing needs to be vastly improved is the feeling of learning happening to you.

I optimized the code and put it in containers and registered those in ECR and then wrote more code to run the containers in parallel in ECS Fargate, monitoring progress in Cloudwatch while persisting to S3 and Glue, and the whole thing accelerated nicely, to the point where I was eager to run the data and get to work on the visualization part of the exercise.

That was when I realized that my running cloud cost for processing the millions of rows in each day of data is in the $2-$3 range, nominal enough on an ongoing basis, but I have a 20 year backlog of data, and the cost of processing all of it would be about $15k. In Fortune 100 business terms this is miniscule. In personal terms it is simply unapproachable.

I was running the backlog data in 6 month batches: my initial goal was to complete 10% of the processing and use that data for prototyping the visualizations and experimenting with machine learning. After some math and marital input and a small bout of depression I accepted that I could barely (not really) afford to complete the processing of the first 6 month batch, which is happening now as an implementation of the sunk cost fallacy.

All of this is before I even approach submitting my resulting data sets to a machine learning fit algorithm, or attempt to use the data in any visualization.

In this first part of my big data learning project I have learned a lot about PySpark and Glue and Python and Docker and ECS and ECR and Athena and a whole lot of other cloud services. Mostly what I have learned is that, even while still in the free tier, private individuals experimenting in Big Data with the intent of innovation are effectively priced out of the current zeitgeist of AI/ML and the data flows necessary to feed them.

This is also important learning.

In the Long Boom growth and optimism were fueled by the sense that anything was possible, and the next big thing was already being built in a garage somewhere nearby, by a person like you. In our present world merely vast global companies complain they cannot afford the cost of the computing required to create a new AI/ML based chatbot that gets the facts incorrect a majority of the time. Only the low single-digit largest companies can afford to compete in trying to get the facts right more often than not. A person sitting outside of a fortune 100 company, with imagination about how this might pertain to some unaddressed challenge, and a personal bank account fueled by dwindling severance and unemployment payments, is going to have to be very creative, in a severely-asymmetric warfare type of way, to participate in this phase of the program.

My very small 'Big Data' dataset will finish processing soon, and I am more determined than ever to make the most of the opportunity, however small, it represents. That feeling you get when you realize you have to make the most of what you have, when you have very little, is also the feeling of learning, happening to you.

I'm a full lifecycle innovation leader with experience in SaaS, ML, Cloud, and more, in both B2B and B2B2C contexts. As you are implementing your value proposition, I can take a data driven approach to helping you get it right the first time. If that seems helpful to you, please reach out. I'm #OpenToWork.