Prashanth Shankar Kumar has over 8 years of experience in data analytics, Hadoop, Teradata, and mainframes. He currently works as a Hadoop Developer/Tech Lead at Bank of America where he develops Hive queries, Impala queries, MapReduce programs, and Oozie workflows. Previously he worked as a Hadoop Developer at State Farm Insurance where he installed and managed Hadoop clusters and developed solutions using Hive, Pig, Sqoop, and HBase. He has expertise in Teradata, SQL, Java, Linux, and agile methodologies.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
This is a point of view document showing the various possible techniques to integrate SAP HANA and Hadoop and their pros & cons and the scenarios where each of them is recommended.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
This is a point of view document showing the various possible techniques to integrate SAP HANA and Hadoop and their pros & cons and the scenarios where each of them is recommended.
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
As Hadoop graduates from pilot project to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative. We’ll look at key scenarios driving Hadoop and RDBMS integration and review technical options. In particular, we’ll deep dive into the Apache SQOOP project, which expedites data movement between Hadoop and any JDBC database, as well as providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
• Capable of processing large sets of structured, semi-structured and unstructured data and supporting system architecture
• Implemented Proof of concepts on Hadoop stack and different big data analytic tools, migration from different databases to Hadoop.
• Developed multiple Map Reduce jobs in java for data cleaning and pre-processing according to the business requirements, Importing and exporting data into HDFS and Hive using Sqoop.
Having Experience in writing HIVE queries & Pig scripts.
1. Prashanth Shankar Kumar
Certified Teradata Developer
Certified Hadoop Developer
Profile
8 years of experience in IT industry and strong experience in Application development, Data
Analytics, Hadoop Platform, Teradata and IBM Mainframes in Insurance and Financial sectors
Around 2.5 years of Expertisein core Hadoop and Hadoop technology stackwhich includes HDFS,
Sqoop, Hive, HBase, Impala, Spark(on going) and Map Reduce programming
Familiar with data architecture including data ingestion, pipeline design, Hadoop information
architecture, data modelling, data mining, machine learning, advanced data processing and
optimizing ETL workflows
Experience in Continuous Integration tools such as Jenkins.
Experience in developing and deploying Web Services (SOAP)
Knowledge on the Restful Interface
Able to assess business rules, collaborate with stakeholders and perform source-to-target data
mapping, design and review.
Conducted induction and orientation sessions for the newly joined peers and acted as mentor
for Hadoop topology and cluster configuration in SF, life and auto insurance.
Full exposure in development using Agile Methodology and good exposure in Agile Process
such as TDD (Test Driven Development), Scrum Iteration.
Strong Knowledge of web-based architecture, Strong hands on technical debugging and
troubleshooting experience with distributed, enterprise applications and having knowledge of
full life cycle software development (SDLC)
Received appreciation certificate from Client Director for data reusability and effectiveness.
Worked inseveral areas ofData Warehouse including Business Analysis,Requirement Gathering,
Design, Development, Testing, and Implementation.
Fully conversant with all aspects of systems analysis, design, testing and entire SDLC.
Optimization of Queries in a Teradata database environment.
Conversant with Teradata Utilities like BTEQ, FASTLOAD, MULTILOAD, FASTEXPORT, and
TPUMP.
Developed UNIX shell script with BTEQ to dynamically generate SQL script and run it in order to
DROP the oldest range on Partitioned Primary Index tables using derived tables and parsing data
dictionary table dbc.index constraints.
Completed upgrading entire Guardian environment from Teradata Version 6.2 to Teradata 12 in
the spring of 2009.
Developed UNIX shell script with BTEQ to dynamically generate SQL script and run it in order to
COLLECT STATISTICS USING SAMPLE by parsing the data dictionary table’s dbc.columnstatistics
and dbc.indexstatistics
Worked with Data Modelers, ETL staff, BI developers, and Business System Analysts in
Business/Functional requirements review and solution design.
2. Worked with data modeling with Erwin and also Visio 2010.
Worked on performance tuning on the sql query as part of performance improvement.
Process oriented, focused on standardization, streamlining, and implementation of best
practices.
Design, implementation and administration of a robust backup plan and recovery techniques.
Implemented Dual Active systems for mission critical applications, and ensured availability of
system for the same.
Excellent Documentation and Process Management skills with an ability to effectively
understand the business requirements to develop a quality product.
Worked as a Lead/ Project Manager and have the expertise to monitor and work on the various
driven approaches.
Skilled communicator Thorough in explaining complex IT knowledge to the subordinates,
management team and functional team.
Technical Skills
RDBMS: Teradata V2R5/V12, SQL Server, Oracle
Hadoop Ecosystem: HDFS, Hadoop MapReduce, HBase, Hive, Pig, Sqoop, Flume,
Zookeeper, Cloudera CDH-4, Kerberos, JSON and YARN
ETL Tools: Data Stage
RDBMS Utilities: BTEQ, FastLoad, MultiLoad, TPump, FastExport and
o Query manager
Prog. Languages: Core Java, SQL, Teradata, Mainframe, JCL, BTEQ, MLOAD,
o FASTLOAD, FAST EXPORT, TPUMP,TPT
Operating Systems: Linux, Unix, Windows Family
Specialized Tools : Amazon AWS, Putty, SOAP UI & Restful Service, SoapUI, Tortoise SVN,
WAT, Puppet, Micro Focus Rumba, IBM Data studio, Abend-aid, File–aid, PgAdmin III,
WinSCP, TRAC,Quality Centre, CA7
Protocol Knowledge: TCP/IP
Technical Training & Certification:
Teradata Certified Developer
Certified Hadoop Developer
Successfully completed the following training programs and certified
o HBASE, YARN, SQOOP, HIVE, PIG
o Postgre SQL (2013, TCS)
o Data Stage (2012, TCS)
o Mainframes/COBOL (2010, TCS)
o Banking Concepts (2010, TCS)
o Teradata (2012, TCS)
o Java, JCL (2009, TCS)
3. Professional Experience
Bank of America, Charlotte, NC Sep’ 2015 – Present
Hadoop Developer/ Tech Lead
Quantitative Risk Technology
The project is intended to convert the existing Oracle code to Hadoop. Entire processing in HDFS
would be done through Impala, Hive, Sqoop, HBase, Map reduce, Autosys, Spark programs and
also handle performance tuning and conduct regular backups.
• Implemented Hive tables and HQL Queries for the reports.
• Developed Impala queries to analyze reducer output data.
• Developed MapReduce programs to parse the raw data, filtered data based on id for faster
processing and store the refined data in partitioned tables.
• Involved in troubleshooting the issues, errors reported by cluster monitoring software
provided by Cloudera Manager
• Create Insert Overwrite queries with dynamic partition to store the data.
• Setting Task Status to display debug information and display the status of the map reduce
job on job tracker web page.
• Used Oozie to automate data loading into the Hadoop Distributed File System and Hive to
pre-process the data on the daily catch up run basis.
• The pre-processed data in AVRO was used as an input to the map reduce program. AVRO
was used for multi output process and map reduce Is better with AVRO.
• Configure a cluster to periodically archive the log files for debugging and reduce the
processing load on the cluster and tune the cluster for better performance.
• Involved in Extracting, loading Data from RDBMS to hive using Sqoop.
• Involved in writing the Oozie workflow for the data and map reduce code to run. Used Fork
and Join in cases where parallel processing can be done.
• Coded a shell wrapper which will help in triggering jobs from UI.
• Involved in design decision for performing the Hadoop transformation.
• Worked on writing HQL for Sqooping data from Oracle to Hadoop. Wrote an oozie
workflow for moving data to stage and then to live.
• Worked on Sqoop import and export of data.
• Tested raw data, executed performance scripts and also shared responsibility for
administration of Hadoop, Hive and Pig.
• Involved in the design of the unstructured JSON data format and building required Serdes
for the Web services.
• Increasing the performance of the Hadoop cluster by using hashing and salting
methodologies to do load balancing.
4. • Optimizing the HBase service data retrieval calls native to region and improving range
based scans.
• Highly involved in designing the next generation data architecture for the unstructured
data.
Languages and DB: Hive, Impala, Spark, Oracle and HBase
Software and Tools: HDFS, Hadoop MapReduce, Hive, Pig, Sqoop, Oozie, Cloudera CDH-4, HUE,
Flume Impala, Micro Focus Rumba, SOAP Service, Mule ESB, Jenkins, SVN, PgAdmin-III,IBM data
studio, Teradata and Oracle
State Farm Insurance, Bloomington, IL Oct’ 2013 – Sep’ 2015
Hadoop Developer/ Development Lead
ICP - CDE – DC8 – MFI Base and Enhancements (Oct 2013 – Till date)
This project is intended to transform the existing Billing and Payments application to future state
by storing and processing the data entirely in HDFS. Entire processing in HDFS would be done
through Pig, Hive, Sqoop, HBase, Map reduce programs and also handle performance tuning and
conduct regular backups. MFI also involves migrating the State farm payment plan information to
the ICP platform to improve the user interface response.
• Worked as a Project Manager/Project lead and mentored all the peer on the system
and induced knowledge on the same.
• Understand business needs, analyze functional specifications and map those to mules and
web services of the existing applications to insert/update/retrieve data from No-SQL HBase
• Installed and managed a 4-node 4.8TB Hadoop cluster for SOW and eventually
configured 12-Node 36TB cluster for prod and implementation environment.
• Implemented Hive tables and HQL Queries for the reports.
• Created web services that would interact with HBase Client API to use get/put methods for
different applications.
• Used JSON data type in Hive. Developed Hive queries to analyze reducer output data.
• Developed MapReduce programs to parse the raw data, populate staging tables and store
the refined data in partitioned tables.
• Involved in troubleshooting the issues, errors reported by cluster monitoring software
provided by Cloudera Manager
• Creating simple rule based optimizations like pruning non referenced columns from table
scans.
• Setting Task Status to display debug information and display the status of the map reduce
job on job tracker web page.
5. • Used Oozie to automate data loading into the Hadoop Distributed File System and PIG to
pre-process the data on the daily catch up run basis.
• Configure a cluster to periodically archive the log files for debugging and reduce the
processing load on the cluster and tune the cluster for better performance.
• Involved in Extracting, loading Data from RDBMS to Hive using Sqoop.
• Tested raw data, executed performance scripts and also shared responsibility for
administration of Hadoop, Hive and Pig.
• Involved in the design of the unstructured JSON data format and building required Serdes
for the Web services.
• Increasing the performance of the Hadoop cluster by using hashing and salting
methodologies to do load balancing.
• Optimizing the HBase service data retrieval calls native to region and improving range
based scans.
• Highly involved in designing the next generation data architecture for the unstructured
data.
•
Languages and DB: DB2, Postgre SQL, MySQL, Expression and HBase
Software and Tools: HDFS, Hadoop MapReduce, Hive, Pig, Sqoop, Oozie, Cloudera CDH-4, HUE,
Flume Impala, Micro Focus Rumba, SOAP Service, Mule ESB, Jenkins, SVN, PgAdmin-III,IBM data
studio and Terdata
ICP - CDE – DC8 – Checkout (Oct 2012 – Oct 2013)
This project is focused on enhancing the customer experience across all products for the billing,
payment and disbursement processes.
• Worked on analyzing Hadoop cluster and different big data analytic tools including Pig,
HBase and Sqoop.
• Responsible for building scalable distributed data solutions using Hadoop.
• Involved in loading data from LINUX file system to HDFS.
• Worked on installing cluster, commissioning & decommissioning of data node, name node
recovery, capacity planning, and slots configuration.
• Created HBase tables to store variable data coming from different portfolios.
Bank of America, Bangalore, India Nov’ 2008 – Sep’ 2012
Teradata Developer
Technical Environment: Teradata V2R12, UNIX Shell Scripting, Teradata SQL
Assistant, TDWM, BTEQ, COBOL,JCL
Description: The main objective of the system is extracting the data from different legacy systems
and loading into mart. Developed the business intelligence system to quickly identify customer
needs and develop better target services using Data stage. And the database is Teradata with
Mainframes a large amount of customer-related data from diverse sources was consolidated,
including customer billing, ordering, and support and service usage. The idea is to build a Decision
Support System for executives.
Key Responsibilities & Achievements:
6. Performed major role in understanding the business requirements and designing and loading
data into data warehouse (ETL).
Working with utilities Like BTEQ, MLOAD, FLOAD etc…
Collection of data source information from all the legacy systems and existing data stores.
Imported various Application Sources, created Targets and Transformations using Data stage
Designer (Source analyzer, Warehouse developer, Transformation developer, and Mapping
designer.
Worked on data modeling using the Visio and Erwin
Involved in Data Extraction, Transformation and Loading from source systems to ODS.
Developed complex mappings using multiple sources and targets in different databases.
Actively participated in the performance of Data stage mappings.
Worked with production support team for solving the production issues.
Knowledge on Data stage and Informatica.
Security administration including creating and maintaining user accounts, passwords, profiles,
roles and access rights
Tuned various queries by COLLECTING STATISTICS on columns in the WHERE and JOIN
expressions
Knowledge on Teradata Architecture.
Knowledge on Star Schema and Snow Flakes Schema.
Developing BTEQ scripts to load the data from the staging tables to the base tables
Created BTEQ scripts to extract data from warehouse for downstream
Performed Unit level of testing as part of development. Also assisted the Testing team for
running SIT/UAT/Pre-Production testing.
IGCAR(Indira Gandhi Centre of Atomic Research, Kalpakkam,Tamilnadu June 2007 – Nov
2008(Internship)
Modelling using Rhapsody:
Worked as an intern and helped in creating a working model for a system.
Worked on creating a logical diagram for a complex system for the plant.
Analysis and design of integration for an instrumentation project using IBM Rational Rhapsody.
Whereas Spectra CX is built on IBM RationalSoftware Architect (RSA), which provides arich UML
modelling capability, this project integrates Rhapsody as the front-end for UML modelling in the
Spectra CX tool.
This entails updates to IGCAR’s instrumentation product using the Rhapsody API to generate
Rhapsody profiles in addition to RSA profiles to extend UML for a particular embedded software
domain, especially based on the Software Communications Architecture (SCA) for software-
defined radios.
Also, plug-in extensions in the Rhapsody tooling push the user’s model for validation and code
generation.