A quick dive into data transformation and migration with Petl
13 Mar 2019Before we dive into data transformation and migration, it is useful to know what Petl is and how to do a quick installation. Petl can be identified as a Python package for extracting, transforming and loading data. Please follow the links below for official documentation and source code for further knowledge. - Documentation: http://petl.readthedocs.org/ - Source Code: https://github.com/petl-developers/petl - Download: http://pypi.python.org/pypi/petl #### Why? There was several reasons behind selecting Petl as my data manipulation tool for the data migration. - Firstly, I had to migrate data from an old MySQL database to a PostgreSQL database. - Secondly, the data had to be manipulated and changed due to new business requirements in the new database. - Thirdly, due to the fact that the number of records were enormous and it had to be a lightweight efficient package. So Petl ticked all the boxes to be selected in my scenario. This could be different in your case and there might be better solutions than Petl. #### How? You can use Petl in any python based project, however in my scenario it was considered best to create a separate project using Django for scalability and the independence of the migration. Once you quickly spin up a Django project, first you have to build up your database connections to both source and your target. You can use ConfigParser to build up the connections which then can be used at any given point to retrieve or write. ``` ## Settings.ini #Source source=provider=MySQLdb,host=54.236.xxx.xx,user=root,password=xx,db=source_database_name,charset=utf0 #Target target=provider=psycopg2,host=54.236.xxx.xx,user=root,password=xx,db=target_database_name ``` ``` #### setting up connections import ConfigParser from django.conf import settings parser = ConfigParser,SafeConfigParser() # Read configuration from settings.ini parser.read('settings.ini') db_connections = build_connections(parser.items('connections') ``` ``` #### build connections connections = {} for item in settings: name = item[0] values = item[1].split(',') provider = values[0].split('=')[1] params = values[1:] con_params = {} for param in params: parts = param.split('=') con_params[parts[[0]] = parts[1] library = __import__(provider) try: connection = library.connect(**con_params) connections[name] = connection except library.OperationalError as e: logging.error("Unable to open connection %s: %s", name, e) ``` Once the connections settings are up and you are connecting to the source database you can begin extracting data. #### Petl functionality You can use petl.io.db.fromdb(dbo, query, *args, **kwargs) to extract a table from a file-like source or a database. Since we have specified our connection earlier you can use it as below. ``` #### extracting a table import petl as etl table = etl.fromdb(source, 'SELECT * FROM users') ``` After retrieving the table/data you can use Petl to manipulate data quite easily. You can use addfield() to simply add any type of new column. If you leave the value empty it will create empty column which then you can fill later. ``` #### addfield() table = table.addfield('last_modified', lambda rec: datetime.now()) table = table.addfield('created', lambda rec: datetime.now()) ``` You can on the fly convert values of a column using convert(table, *args, **kwargs). For example you can use this to update user passwords to more secure passwords using SHA1PasswordHasher ``` #### SHA1PasswordHasher from django.contrib.auth.hashers import SHA1PasswordHasher def hash_password(password) hasher = SHA1PasswordHasher() encoded = hasher.encode(password, hasher.salt()) return encoded table = table.convert('password', hash_password) ``` Another handy function which is most useful is the join(). You can use join two database tables as soon as they are extracted using a common key. This function also supports compound keys which can be used to join the table. For an example I want to join two user table into one to remove data redundancy. Please refer to the image below. This is a simple left join on the common key “id”. ``` #### joins() query = """SELECT user_id as id, company_name FROM corporate_user_info""" second_query = """SELECT user_id as id, full_name FROM users_personal_info""" source = fromdb(source, query) source2 = fromdb(source, second_query) final_source = join(source, source2, key=['id']) ``` In case you forgot to add some data to a table you can appenddb() to just bulk append a whole new data set to existing table. Once you are done with your data manipulation it’s time to migrate the data to the new database. Petl makes it easier with function called todb(). ``` #### todb() source.todb(target, 'new_user_table') ``` Yes, it is easy like that. I have only written simple SQL queries for the code above, however If you wish to manipulate data in the SQL Petl also allows you write advance SQL with the specified SQL adapter. The toughest part using the Petl would be to set up the database connections I would say. Once the connections are up and ready to go, it would not be hard to manipulate the data however the way you want. Please do refer to the documentation for more advance manipulation. - Documentation: http://petl.readthedocs.org/