2. Justis Publishing is an independent publisher of electronic legal information, which owns one of the
biggest online collection of legal cases UK, and international law as well as other legal documents
including UK and international law legislation, acts, parliamentary resources publications.
On daily basis the company receives hundreds of documents from its data providers, legal information
publishers like England and Wales Civil Appeal Judgments, Canada Law Reports, Bermuda Law Reports,
Jersey Law Report as well as various legal publications from Cambridge University Press, Oxford
University Press, LexisNexis and others.
Justis Publishing faced the problem in processing incoming data from its data providers. The projects
posed several challenges. The biggest one was that all data provider use different types of source data
files, which are exported or extracted or copied, from disparate systems in many different formats,
preventing easy and quick integration and information sharing. Many of the incoming file types have
different designs, coding standards, business rules, structured and unstructured, with a big margin for
error. The vast majority of the data was not properly structured. Documents from one data source
often contains many types of data irregularities and inconsistencies like
unclosed tags in XML files make them inaccessible to further automatic processing
missing carriage return or new line sign in text files (typical for text files exported in Unix system)
leading blank lines
unrecognized symbols
non compliant encoding difficult to parse
Justis Publishing had been struggled with this problem since the beginning of the business and has
never solved it completely. As Justis Publishing is heavily dependent on its data providers, this problem
imposed the extremely big impact on the business.
3. As a SSIS developer, I was responsible for ETL processes and tools as well as database loading
and manipulation to ensure that the design complies with requirements, established
methodologies, and best practices. In doing so I use various data audit and validation
methods and procedures to ensure quality and effectiveness of legacy data conversions to
Data Warehouse. Then I design the solution by conceptualizing data flows and
transformations, translating those into sequence of steps, performing source to target data
mapping, developing code, debugging, and testing. I utilize the benefits of various ETL tools
and scripting languages including Ruby, C#, VBA, Java and Javascript and develop shell scripts,
and stored procedures to support new solution.
The solution I developed extracts data from the files, performs various cleansing, validation,
transformation, conversion manipulations, stores data in a relational database and then
establishes interlinks between the existing documents using advanced data processing based
on fuzzy logic algorithm.
Once the solution deployed I monitored performance of ETL processes, correct any identified
issues by perform root cause analysis on problematic queries and ETL jobs.
The projects have revolutionised the way Justis Publishing works with their data providers:
Data from multiple sources move far more efficiently and in a scalable way
All data are process minutes after they arrive not months it used to
Interlinks between new and existing legal cases improved significantly.