A Scalable Data Transformation Framework
@ Penton using the Hadoop Ecosystem
Raj Nair
Director–Data Platform
Kiru
Pakkiris...
AGENDA
• About Penton and Serendio Inc
• Data Processing at Penton
• PoC Use Case
• Functional Aspects of the Use Case
• B...
About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture...
About Serendio
Serendio provides Big Data Science
Solutions & Services for
Data-Driven Enterprises.
www.serendio.com
Data Processing at Penton
What got us thinking?
• Business units process data in silos
• Heavy ETL
– Hours to process, in some cases days
• Not even...
Data Processing Pipeline New
features
New
Insights
New
Products
Biz ValueAssembly Line
processing
Data Processing Pipeline...
Penton examples
• Daily Inventory data, ingested throughout the day
(tens of thousands of parts)
• Auction and survey data...
What were our options?
Adopt Hadoop Ecosystem
- M/R: Ideal for Batch Processing
- Flexible for storage
- NoSQL: scale, usa...
POC Use Case
Primary Use Case
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Bui...
Functional Aspects
Data Scrubbing
• Standardized names for fields/columns
• Example - Country
– Unites States of America -> USA
– United Stat...
Data Mapping
• Converting Fields - > Ids
– Manufacturer - Caterpillar -> 25
– Model - Caterpillar/Front Loader -> 300
• Re...
Data Exporting
• Move scrubbed/mapped data to main RDBMS
Current Design
• Survey data loaded as CSV files
• Data needs to be scrubbed/mapped
• All CSV rows loaded into one table
•...
Key Pain Points
• CSV data table continues to grow
• Large size of the table impacts operations on rows in a single
file
•...
Criteria for New Design
• Ability to store an individual file and manipulate it easily
– No join/relationships across CSV ...
Big Data Architecture
Solution Architecture
Existing Business
Applications
REST API
CSV and Rule Management Endpoints
HBASE
HADOOP HDFS
CSV
File...
Hbase Schema Design
• One row per HBase row
• One file per HBase row
– One cell per column qualifier (simple and started t...
Hbase Rowkey Design
• Row Key
– Composite
• Created Date (YYYYMMDD)
• User
• FileType
• GUID
• Salting for better region s...
Hbase Column Family Design
• Column Family
– Data separated from Metadata into two or more
column families
– One cf for ma...
M/R Jobs
• Jobs
– Scrubbing
– Mapping
– Export
• Schedule
– Manually from UI
– On schedule using Oozie
Sqoop Jobs
• One time
– FileDetailExport (current CSV)
– RuleImport (all current rules)
• Periodic
– Lookup Table Data imp...
Application Integration - REST
• Hide HBase AP/Java APIs from rest of
application
• Language independence for PHP front-en...
Lessons Learned
Performance Benefits
• Mapping
– 20000 csv files, 20 million records
– Time taken – 1/3rd of RDBMS processing
• Metrics
– ...
Hbase Tuning
• Heap Size for
– RegionServer
– MapReduce Tasks
• Table Compression
– SNAPPY for Column Family holding csv d...
Application Design Challenges
• Pagination – implemented using intermediate REST layer and
scan.setStartRow.
• Translating...
Hbase Value Proposition
• Better response in UI for CSV file operations - Operations
within a file (map, import, reject et...
Roadmap
• Benchmark with 0.96
• Retire Coprocessors in favor of Phoenix (?)
• Lookup Data tables are small. Need to find a...
Wrap-Up
Conclusion
• PoC demonstrated
– value of the Hadoop ecosystem
– Co-existence of Big data technologies with current solutio...
Thank You
Rajesh.Nair@Penton.com
Kiru@Serendio.com
Upcoming SlideShare
Loading in...5
×

A Scalable Data Transformation Framework using Hadoop Ecosystem

877

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
877
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

A Scalable Data Transformation Framework using Hadoop Ecosystem

  1. 1. A Scalable Data Transformation Framework @ Penton using the Hadoop Ecosystem Raj Nair Director–Data Platform Kiru Pakkirisamy CTO
  2. 2. AGENDA • About Penton and Serendio Inc • Data Processing at Penton • PoC Use Case • Functional Aspects of the Use Case • Big Data Architecture, Design and Implementation • Lessons Learned • Conclusion • Questions
  3. 3. About Penton • Professional information services company • Provide actionable information to five core markets Agriculture Transportation Natural Products Infrastructure Industrial Design & Manufacturing Success Stories EquipmentWatch.com Govalytics.com Prices, Specs, Costs, Rental Analytics around Gov’t capital spending down to county level SourceESB NextTrend.com Vertical Directory, electronic parts Identify new product trends in the natural products industry
  4. 4. About Serendio Serendio provides Big Data Science Solutions & Services for Data-Driven Enterprises. www.serendio.com
  5. 5. Data Processing at Penton
  6. 6. What got us thinking? • Business units process data in silos • Heavy ETL – Hours to process, in some cases days • Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements
  7. 7. Data Processing Pipeline New features New Insights New Products Biz ValueAssembly Line processing Data Processing Pipeline The Data Processing Pipeline
  8. 8. Penton examples • Daily Inventory data, ingested throughout the day (tens of thousands of parts) • Auction and survey data gathered daily • Aviation Fleet data, varying frequency Ingest, store Clean, validate Apply Business Rules Map Analyze Report Distribute Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future Various data formats, mostly unstructured
  9. 9. What were our options? Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility Expand RDBMS options - Expensive - Complex HBASE Oracle SQL Server Drools
  10. 10. POC Use Case
  11. 11. Primary Use Case • Daily model data – upload and map – Ingest data, build buckets – Map data (batch and interactive) – Build Aggregates (dynamic) Issue: Mapping time
  12. 12. Functional Aspects
  13. 13. Data Scrubbing • Standardized names for fields/columns • Example - Country – Unites States of America -> USA – United States -> USA
  14. 14. Data Mapping • Converting Fields - > Ids – Manufacturer - Caterpillar -> 25 – Model - Caterpillar/Front Loader -> 300 • Requires the use of lookup tables and partial/fuzzy matching strings
  15. 15. Data Exporting • Move scrubbed/mapped data to main RDBMS
  16. 16. Current Design • Survey data loaded as CSV files • Data needs to be scrubbed/mapped • All CSV rows loaded into one table • Once scrubbed/mapped data is loaded into main tables • Not all rows are loaded, some may be used in the future
  17. 17. Key Pain Points • CSV data table continues to grow • Large size of the table impacts operations on rows in a single file • CSV data could grow rapidly in the future
  18. 18. Criteria for New Design • Ability to store an individual file and manipulate it easily – No join/relationships across CSV files • Solution should have good integration with RDBMS • Could possibly host the complete application in future • Technology stack should possibly have advanced analytics capabilities NoSQL model would allow to quickly retrieve/address individual file and manipulate it
  19. 19. Big Data Architecture
  20. 20. Solution Architecture Existing Business Applications REST API CSV and Rule Management Endpoints HBASE HADOOP HDFS CSV Files Master database of Products/ Parts Current Oracle Schema Push Updates Insert Accepted Data RDB-> Data Upload UI API Calls MR Jobs Launch Survey RESTDrools Use HBase as a store for CSV files Data manipulation APIs exposed through REST layer Drools – for rule based data scrubbing Operations on individual files in UI through Hbase Get/Put Operations on all/groups of files using MR jobs
  21. 21. Hbase Schema Design • One row per HBase row • One file per HBase row – One cell per column qualifier (simple and started the development with this approach) – One row per column qualifier (more performant approach)
  22. 22. Hbase Rowkey Design • Row Key – Composite • Created Date (YYYYMMDD) • User • FileType • GUID • Salting for better region splitting – One byte
  23. 23. Hbase Column Family Design • Column Family – Data separated from Metadata into two or more column families – One cf for mapping data (more later) – One cf for analytics data (used by analytics coprocessors)
  24. 24. M/R Jobs • Jobs – Scrubbing – Mapping – Export • Schedule – Manually from UI – On schedule using Oozie
  25. 25. Sqoop Jobs • One time – FileDetailExport (current CSV) – RuleImport (all current rules) • Periodic – Lookup Table Data import • Manufacture • Model • State • Country • Currency • Condition • Participant
  26. 26. Application Integration - REST • Hide HBase AP/Java APIs from rest of application • Language independence for PHP front-end • REST APIs for – CSV Management – Drools Rule Management
  27. 27. Lessons Learned
  28. 28. Performance Benefits • Mapping – 20000 csv files, 20 million records – Time taken – 1/3rd of RDBMS processing • Metrics – < 10 secs vs (Oracle Materialized View) • Upload a file – < 10 secs • Delete a file – < 10 secs
  29. 29. Hbase Tuning • Heap Size for – RegionServer – MapReduce Tasks • Table Compression – SNAPPY for Column Family holding csv data • Table data caching – IN_MEMORY for lookup tables
  30. 30. Application Design Challenges • Pagination – implemented using intermediate REST layer and scan.setStartRow. • Translating SQL queries – Used Scan/Filter and Java (especially on coprocessor) – No secondary indexes - used FuzzyRowFilter – Maybe something like Phoenix would have helped • Some issues in mixed mode. Want to move to 0.96.0 for better/individual column family flushing but needed to 'port' coprocessors (to protobuf)
  31. 31. Hbase Value Proposition • Better response in UI for CSV file operations - Operations within a file (map, import, reject etc) not dependent on the db size • Relieve load on RDBMS - no more CSV data tables • Scale out batch processing performance on the cheap (vs vertical RDBMS upgrade) • Redundant store for CSV files • Versioning to track data cleansing
  32. 32. Roadmap • Benchmark with 0.96 • Retire Coprocessors in favor of Phoenix (?) • Lookup Data tables are small. Need to find a better alternative than HBase • Design UI for a more Big Data appropriate model – Search oriented paradigm, than exploratory/ paginative – Add REST endpoints to support such UI
  33. 33. Wrap-Up
  34. 34. Conclusion • PoC demonstrated – value of the Hadoop ecosystem – Co-existence of Big data technologies with current solutions – Adoption can significantly improve scale – New skill requirements
  35. 35. Thank You Rajesh.Nair@Penton.com Kiru@Serendio.com

×