The need for ETL work has been steadily increasing as of late. Since I have considerable background in ETL work I thought I'd share some tips through the construction of a really simple ETL.
GoodParseETL is a very simple ETL program that has been designed to illustrate just how quickly a decent ETL can be written but also to show off some techniques I’ve used for improving flexibility and scalability in a very simple ETL program.
First let’s talk about the speed. I started writing this program on May 29th, 2019 at 6 AM and had my first fully functioning extract to an SQL database by 2:45 PM that same day. I commented it and finished for the day. Although the Visual Studio project was one I had created some time ago, it lay empty until May 29th. On May 30th, 2019 I revisited the code to clean it up a bit and make some minor adjustments and then started writing this article. So it took less than 2 days to complete.
Maintainability and flexibility were targets for me so everything is compartmentalized. File operations are in the “FileOperations” class and nowhere else. Database operations are in the “DBOperations” class and nowhere else. Each class is simple and to the point but easily expandable. I’ve even designed the “ParsingOperations” class so you can add additional parsing methods to it effortlessly and it will accept everything from simple parsing methods to entire parsing objects if you wish.
The parsing mechanisms I use focus on the Regex (Regular Expression) classes. Regex expressions utilize a string of characters as the parsing rule to be followed rather than hard coded system calls. The string of characters can define every kind of data parse from simple comma-delimited rows to complex log file extractions like those I had to do for the banks. And since a string of characters is a data element itself, it can be stored in a database which is what this system does.
The ETLManager database is a ridiculously simple SQL database with a single flat table. In it I store ETL definitions. The filename of the file to be read, the Regex string to use on the file to parse the data, database and name of the destination table where the parsed results need to go, all go into the ETLManager record. GoodParseETL is designed as a command line tool and is designed to be provided the ID number of the ETLManager record you wish processed, so it can house literally thousands of parsing definitions that you can call up just by specifying the ETLManager ID.
The data interactions with the code have been designed for performance and compartmentalization. Rather than the code sending a request to store values to the SQL server and then waiting for a response back or worse, having to do some kind of server side data validation, all data interactions are performed via stored procedure. The code doesn’t know about or care what the SQL server is doing. The task of storing the data is delegated to the SQL server through stored procedures where any special data storage needs can be handled.
I do have a very much larger version of this same program that I’ve used for years with a very robust logging mechanism that records any records that didn’t get stored for any reason as well as fallback mechanisms for disaster recovery and multithreading but what I’ve provided you here is the very core of what its bigger brother has done for years. The full blown ETL system has had as many as 1400 different ETL profiles to handle for parsing every web server log throughout the bank I worked at and would regularly process over 500Gb of flat files in a single night.
The real stars of this show are that its data driven and designed for scalability.
The Regular Expression
I want to spend some time on this part because of how truly powerful it is.
The implementation used in this example simply parses a comma delimited file. But the identical lines of code can parse some very extravagant data separation situations. Here I’ve copied the parse code from the GoodParseETL into a test project where I can apply values on the fly. Have a look at some of these results;
I’ve created a line of text in the code above like that which you may find in a text file to be parsed. We want all of the sets of numbers in their original format. In the ParseRule I’ve given a RegEx string that will exclude all of the extraneous values and give me a list of number. There are others ways to do this that are easier but not with the same control and flexibility and I wanted to give you a look at how complex the parse rule can get in this little program if needed. However, with a little work you can start creating parse rules like the one below; (hover mouse over image to enlarge except on Chrome)
This line is used to parse a Java Web Application Server log but this one takes advantage of C#’s unique ability to use the orange RegEx string to parse the data directly into variables within the code. When this RegEx is processed, the variables Timestamp, Date, StartTime, EndTime, Elapsed and Transact will be produced and will contain the data from the green sample log entry below.
What I’ve provided
This code has been designed for you to play with and hack up so I’m providing you a full Visual Studio 2017 project. You’re free to use it as you wish provided you maintain the credits with the code.
I’m also providing you’re a the MS SQL stored Procedures that will create the ETLManager database. You'll have to change the connection string in the App.Conf so the system will be able to find the ETLManager database and connect to it.
Remember that this is just a simple sample. Thus, the target server name has been hard coded in the DBOperations class. I recommend moving it into the ETLManager record with any username or password needed to log into the server and set up the feed to the Profile Config. After that you’ll be able to dynamically point the ETL output at any SQL server you want.
To access the code and SQL script visit my project Github: https://github.com/JohnJGoodwin/GoodParseETL