Converting an integer id field to Foreign Key field without data migration

Imagine a scenario, where there is a legacy model with a integer id field, however it is not a foreign key.

Below is an very simple example:

Of course it would be much better if the model is to be changed to this, where the car_id IntegerField becomes a  car  ForengnKey:

If we run makemigrations the files generated will look something like this:

If the wheel model/table contains data, then Django migration will ask for a default value to be specified. One way to get around this is to handcraft a migration, so a data migration will not be necessary:

The trick here is that we ‘pin’ the underlying column name with db_column before renaming the field.

I hope this will be useful!

Some Thoughts on Aggregating Moderate Amount of Data

As part of my role, I need to perform aggregates (group by operation) on some medium size raw text file data, in the region of 100s to 200s of Gb, as a one-off exercise.

I have tried using R is performing such aggregates, however, it is slow for a relative small size file. Also, there is memory concern as the biggest data file will not fit in memory.
Therefore I hesitate to carry on with this approach.

I have also tried Mysql. Mysql will perform a sort when doing a group by (or use suitable
index if available), which is a cause of slowness with our data size. Table partitioning helps, but there is limit on how many partitions are allowed on a mysql table.

I was interested in Apache Hadoop Hive initially, as it is SQL-like, can handle big data, and have partition table as a feature. However, by trial and error, I have found that without a partition table, it is already very fast using group by directly. It is perhaps to do with the MapReduce paradigm, and its implicit hashing, which is akin to partitioning. (It will be good if there are some feedback on the above ‘analysis’. )

At the end, I moved on from Hive to Twitter Scalding. If the data is homogeneous and uniform,
Hive will be a good tool to complete the task, ie, put all files in an external tables, use a suitable plugin called Serde, and just run a group by SQL.

However, the data files I am working on are not exactly of the same format. To perform it in Hive will require customization with extensions written in Java. On the other hand, Scalding is a library for Scala by Twitter, a general programming language, and I feel is more suited for this task.

Getting File Name from Textfiles in Scalding

As I mentioned before, I need to perform aggregates on some medium size raw text file data. To speed up the process, I experimented with Twitter Scalding, a big data Tool derived from Hadoop family, targeted for programming language Scala.

Apart from the setup, the actual code I need to write is not really rocket science (precisely the reason I have chosen it). However, for trouble shooting, I do want to keep the original file name as one of the data points.

One way to do this is open each file using TextLine, create many Pipes, and merge those pipes at the end. However, opening multiple Files with TextLine and merge afterwards become very slow comparing to just using MultipleTextLineFiles, a class specialised in opening multiple files.

I end up extending MultipleTextLineFiles from scalding, altering it such that it can emit file names as well. If this is of use to anyone, my implementation can be found in Gist. (I have tried to keep everything in Scala, but I encounter some difficulties in extending one of the class from Cascading, and unfortunately for one class I have to revert using Java instead.)

Caveat of Partitioning with Timestamp on Mysql

On a project I am working on, I have a fair amount of data to be processed. I wanted to load the data into a huge table so I can analyze the data quality better. While the data set is not Petabytes in size, it is not small either. And as I needed some index on the table, and to alleviate the growth of average insertion speed which is of \(O(n log(n))\), partitioning comes in mind.

I think the most road traveled is to use TO_DAYS(date) on a Datetime field. However, if for whatever reason, you wanted to use timestamp instead of datetime, one need to be careful how the timestamp field is declared, otherwise it will be defaulted to use CURRENT_TIMESTAMP on insertion and update. I had to default it to NULL to override the default timestamp behaviour(and hence have a unfortunately side effect of allowing nulls).

Below is an example.

Heuristic Views on Random Topics