Project I – BigDataCMPE 226 – Database Design Submitted by Manjula Kollipara Roopa Penmetsa Sumalatha Elliadka Sridhar Srigiriraju Project Advisor Prof. John Gash October, 2012
AbstractThe goal of this project is to understand the ORM and JDBCmethodologies and explore designconsiderations for managing big data.IntroductionOur project is implemented for handling huge data sets and measuring the performance . The designconsiderations are implemented using both ORM and JDBC frameworks. In order to compare theperformance results, we tried to have similar implementation design and resources.The data set for this project has weather details downloaded from the internet. The data has threefiles, but we decided to ignore one of them which is a sub-set of another file. Hereon, the mainfiles are going to be referenced asStation and Forecast. Though the Station data is mostlyconstant, the forecast data grows dynamically.ImplementationIn this project, we downloaded the weather data from the internet using shell script and collecteddata for 4 days. Approximately 2 GB of data is collected and loaded to the database using filereading mechanisms. The following table provides the general overview of implementationdetails. Implementation Details Files DB Table names mesowest_csv.tbl Station mesowest.out Forecast Number of Tuples Station ~3.4 M Forecast ~3.5M System Set-up Number of Cores 2.9 GHz Intel Core i7 Memory 8GB
Tools & Technology IDE Eclispe Tools ORM, JDBC, Progamming Languages Java, XML, SQL Testing Strategies Unit test case, HQL, CriteriaPerformance checks Loading data: The script downloads the data into separate folders with the corresponding time stamp. All the projects load data from these folders and store into the database. This way we tried to compare the load time of all the designs. Insert/update/retrieve operations on data: These operations are encountered while loading the data and can be significantly affected by the database design aspects like relationships and constraints. Search or find operation on data: The test queries focus mainly on the speed in fetching data based on few search criteria. SQL vs HQL: By implementing queries in test cases using SQL/HQL, we tried to compare the performance based on the query language specific to JDBC and Hibernate.Design/ Approaches considered: 1. Denormalized Data To compare the load performance Performed basic clean up: Redundancy of data, invalid data Station and Forecast data is just loaded from the folders into database
No relationships in this design Station and Forecast tables do not hold constraints or integrity rules on any columns Observations JDBC Provides COPY command that allows user to dump the data from file into database. If no integrity rules adopted, then COPY command breaks and throws exception In this project, base version inserts tuple by tuple into the database. This method requires hitting database every time inserts are performed. Hibernate ORM allows traversing object by object. Developer can control the number of hits to database (Can perform batch updates). Does not provide an option load all 3.X M of data at a time into the database. In this project data is loaded in batches of size 100 records at a time.2. Normalized data To analyze the effects of normalization on huge data Removed redundant MNET, SLAT, SLON, SELV columns from Forecast table Assumption: StationprimaryId is unique across all the Stations.
ObservationsJDBC: IntroducesStationId to avoid duplicate Station information in the database. Normalized the Forecast table because it is dynamic. Implemented foreign key in Forecast referring to Station. For each Forecast entry, the corresponding ‘StationId’ has to be looked up from Station. Implemented stored procedure to implement ‘SaveOrUpdate’ strategy. It also ensuresreferential constraints are not violated Implementation is comparatively simple.Hibernate: Implemented OneToMany between Station and Forecast table. The data is loaded to both tables simultaneously by associating Station object with Forecast object during runtime. This approach involved alot of file searching time and resulted in performance degradation. Although all the fields are stored as string a better approach would be to use appropriate data types for double and date fields. Developer had control over the Database accessibility. Implementation is complex, since hibernate had many hidden/background implementation. For example, extra logic needs to be implemented to map a detached object with current session.
3. Indexing To analyze the search query performance in test cases To analyze the load performance when columns are indexed Indexed frequently used columns in the test cases. In project we tried to implement indexing on temperature field in Forecast table. Implicitly created indexes on primary keys by postgres are not considered for the difference in performance since they exist in the non-indexed version too. This method is implemented only for hibernate version. Observation Indexing resulted in extra loading time but search queries were faster. But, we tested this design on a separate system and the results were not comparison compatible.4. Partitioning Performed Partition in Forecast rather on Stations. Since size of Stations was small and mostly constant, not much benefit in partitioning it. Since Forecast is more dynamic and keeps increasing, more benefit in partitioning it. Implemented partitioning in both JDBC and Hibernate. Number of Forecast Partitions = 2 o Assumed primaryId to be unique across all the Stations. o Location algorithm: Determines the Forecast Partition# using a hash on the Station’s primaryId. Due to lack of time, we could not get through the testing performance numbers. Observations JDBC: Number of partitions can be changed by changing a single variable in JDBCUtil.Java Uses ForeignKey relation from Forecast to Station. Implemented using a single Forecast class, but the JDBC query is executed on the appropriately selected partition
Hibernate: Number of partitions can be changed by changing a single variable in HibernateUtil.Java Uses @ManyToOne relation from Forecast to Station. Implemented using an abstract base Forecast class and two child classes Forecast1, Forecast2 that represent the 2 partitons. Initially used @Inheritance(strategy=InheritanceType.TABLE_PER_CLASS) strategy and then moved on to @MappedSuperclass strategy.Partitioning served as a good exercise before moving on to sharding.5. Sharding To analyze the performance of the normalized version when sharded into 2 databases Implemented sharding in both JDBC and Hibernate. Number of shards = 2 o Assumed primaryId to be unique across all the Stations. o Location algorithm: Determines the Shard# using a hash on the Station’s primaryId. Multiple postgres instances on the same host. Observations
JDBC: Developer had to think from database designer perspective to design and implement sharding methodology. Number of shards can be changed by changing a single variable in JDBCUtil.Java Uses ForeignKey relation from Forecast to Station. Implemented using a single Station and Forecast class, but the JDBC query is executed using the connection to the appropriately selected instance.Hibernate: Found implementation process was very developer friendly. Implemented ManyToOne strategy for creating relationships. Also, the Station table is loaded first and then Forecast data is entered to database. The relationship is mapped by retrieving Station object from database and mapped to Forecast details while inserting Forecast data. This avoided the file searching and mapping time and provided performance gain. Proper data types are used for the columns in the database. Number of shards can be changed by changing a single variable in HibernateUtil.Java Uses @ManyToOne relation from Forecast to Station. Implemented using a single Station and Forecast class, but the object is saved using the session to the appropriately selected instance.
Test casesThree types of search queries are written to test the query performance of various typesapproaches we used in this project. The following section describes the various approaches usedby us to improve the performance. For a given ‘Stationid’: Fetches related records from Stationand Forecast tables. For a given temperature range: Fetches all the related StationandForecast details. For a given Forecast time: Fetches all the relatedStation and Forecast details.Performance comparisontable: JDBC Hibernate Denormalization* Loading(in hrs) 10 1.3 194 986 Test cases 10500 4099 (in msecs) 780 1033 Normalization* Loading(in hrs) 3 6 113 2638 Test cases 7747 16863 (in msecs) 768 1231 Indexing** Loading(in hrs) - 11 119000 Test cases - 60000 (in msecs) 2000 Sharding* Loading(in hrs) 2.2 2.45 127 1471 Test cases 6363 490267 (in msecs) 885 2539*Performed on a Macbook with 8GB RAM**Performed on a PC with 4GB RAM
Performance comparison graphs Design vs. Load Performance (in hrs) 10 9 8 7 6 JDBC 5 4 HIBERNATE 3 2 1 0 Denormalized Normalized ShardingDenormalized: 1. In JDBC the records are loaded to the database tuple by tuple. This involves hitting the database each time and resulted in huge performance loss. 2. In Hibernate data is loaded in batches. Since no constraints were there, that data is loaded faster and took less write time.Normalized: 1. In JDBC since the data is normalized in Forecast table, the load performance is drastically improved. 2. In hibernate: since we tried to load both Station and Forecast data simultaneously, it required reading all the record in the Forecast file for each Station. This resulted in high file reading time and resulted in performance loss. Also, hibernate is slower than JDBC because of addition ORM layer.Sharding: 1. The load time is faster since the data is distributed and results in database operations on comparatively lesser set of records
Design vs. Query1 Performance (in msecs) 3000 2500 2000 JDBC 1500 HIBERNATE 1000 500 0 Denormalized Normalized ShardingQuery1: Finding all the details(including Forecasts) related to the Station by passing Station Idas parameter.Denormalized: 1. In JDBC join is faster than ORM HQL join since, JDBC is directly accessing the database.Normalized: 1. In JDBC join is faster than ORM Lazy fetching since, JDBC is directly accessing the database. 2. For a normalized version, the query performance can be improved by Egar fetching.Sharding: 1. In both JDBC and Hibernate, the performance is improved, since less number records are involved in searching the data.
Design vs. Query2 Performance (in msecs) 500000 450000 400000 350000 300000 JDBC 250000 200000 HIBERNATE 150000 100000 50000 0 Denormalized Normalized ShardingQuery2: Fetching all the records false between the temperature range 80 and 90. 1. This query has high response time due to the larger amount records (~2.3 million) retrieved from database. 2. JDBC Query performs faster in normalized version due to foreign key constraint. And performance is improved in sharding due to less searching involved. 3. Hibernate HQL queries are faster in denormalized version since there is less mapping involved irresptive of query type. 4. The response time increased for hibernate sharding since, data is fetched from both shards and an inner query needs to be performed to get the mapped ids. Also ORM mapping time increases due to extra sessions and transactions involved.
Design vs. Query3 Performance (in msecs) 2500 2000 1500 JDBC 1000 HIBERNATE 500 0 Denormalized Normalized ShardingQuery: Fetching all the records for a particular timestampThe reasons for the performance variations in this graph are same as the query2. The decrease intime is due to the lesser number of records returned compared to the query2.
Lessons Learnt: Manually reading from each file takes lot of time. Hibernate requires less lines of code compared to JDBC. But the analysis and usability of hibernate is much more complex compared to JDBC. Hibernate demands a steep learning curve due to the complicated background processes. Easy to implement using hibernate when there is a database change compared to JDBC. The future work of this project includes comparing the performance by introducing second level cache to hibernate and caching mechanisms for JDBC. Avoid triggers, indexes while loading the data. The partitions or shards should be distributed as uniformly as possible. Hibernate provides a lot of flexibility in writing queries compared to JDBC. Criteria are useful when the query needs to be generated dynamically. Also the various fetching strategies (eager/lazy fetching) provide the flexibility to frame the selection process. Old paradigm with JDBC, but easier to debug.New paradigm with Hibernate, tough to debug. Thinking and working with objects using serialization is better than thinking and working with SQL query strings. Using Criteria to perform queries is powerful enough to retrieve related objects. This required explicit joins or multiple queries in JDBC. Using Hibernate, the table structures were controlled directly from object definition, provided better control and flexibility to recreate the tables. Using JDBC this requires matching the SQL structure with the Java objects/code. Sessions are an important concept in Hibernate that improves performance using batching and caching. In JDBC, caching is explicitly coded by developer. There is no ‘SaveOrUpdate’ concept in JDBC/SQL queries. This has to be implemented using a stored procedure that performs an ‘update’ else ‘insert’. Hibernate provides this flexibility. JDBC can have update conflicts when two threads can simultaneously update a record, unless extra version information is embedded in each query by the developer. Using Hibernate’s @version it is easy to achieve this behavior. Using composite keys is a bit more work in Hibernate, but easy and flexible. Explicit ResultSet to object conversion using JDBC. Using Hibernate, this happens implicitly. Using JPA Relationships provides lot of flexibility in embedding related objects and expressing relations between them.
Instructions to run program1. mesowest.sh is used to download files from the internet. Open it with an editor and change the variable ‘arch’ to the path you want store the downloaded data.2. For all the project folders make the following changes to DirectoryRead.java file: Change the path of variable homePath to the path where all downloaded folders are kept. Optionally, this path can be provided using the 1st command-line argument. Change the path of variable archivePath to the path to where you want to move the data after reading from file. Optionally, this path can be provided using the 2nd command-line argument.3. For all hibernate projects make the following changes to hibernate.cfg.xml file: Edit the configuration details. Optionally, an alternate configuration file can be provided using the 3rd command-line argument.4. For all JDBC related projects make the following changes to JDBCUtil.java file: Change the database connection details. By default, 1st instance is at localhost:5432 and 2nd instance is at localhost:5433.5. For all JDBC related projects run .sql files to create the database schema.6. In AutomatedBaseVersion, we implemented the automation of directly downloading data from the website and load it to the database.Contribution Manjula Kollipara o JDBC Normalization, JDBC Partitioning, JDBC Horizontal Sharding o Hibernate Partitioning, Hibernate Horizontal Sharding o Jar& Ant builds, Database design Roopa Penmetsa o Hibernate indexing version o Test cases for all hibernate versions o Ant builds Sumalatha Elliadka o Hibernate base version, Hibernate normalization.
o JDBCJUnit library functions o Test scripts Sridhar Srigiriraju: o Created shell scripts to automate the data download process o Database designReferences1. http://www.mkyong.com/hibernate/2.http://docs.jboss.org/hibernate/orm/4.0/devguide/en-US/html_single/3.https://forum.hibernate.org/viewtopic.php?f=1&t=966223