Big Data Transformation Powered By Apache Spark.pptx
Report for internship
1. 1 | P a g e
Internship Report
Ghulam Ishaq Khan Institute of Engineering
Sciences and Technology
2. 2 | P a g e
Name: Salman Khan
Registration Number: 2012338
Organization: Teradata
Duration: 1 Month (Four Weeks)
Submission Date: 30th
November 2015
Faculty of Computer Science and Engineering (Fall- 2016)
3. 3 | P a g e
Acknowledgement:
First I would like to thank Sir Hassan Waqar, Awais Ijaz Professional
Services Consultant , for giving me the opportunity to do an internship
within the organization. For me it was a unique experience to be in
Teradata Pakistan and to study an interesting data warehousing. It also
helped to get back my interest in databases and to have new plans for my
future career.
I also would like all the people that worked in the office of Teradata in
Lahore. With their patience and openness they created an enjoyable
working environment.
Furthermore I want to thank all the students, with whom I did the fieldwork.
We experienced great things together and they have shown me their final
year projects.
At last I would like to thank the all the administration staff of Ghulam Ishaq
Institute of engineering Sciences and technology and the faculty members
of Computer science department, especially Sir Fawad .
4. 4 | P a g e
EXECUTIVE SUMMARY:
The report is specially meant for my internship program. It is concerned to
a brief study of operations, functions, tasks I performed during my
internship program.
Teradata is playing leading role in providing powerful, enterprise big data
analytics and services that include Data Warehousing, Data Driven
Marketing, BI and CRM.
In preparation of this report I have tried my best to provide all possible
information about the operations, functions, tasks and the corporate
information of Teradata Pakistan in brief and comprehensive form.
7. 7 | P a g e
About Teradata:
Introduction:
Teradata Corporation is a publicly held international computer company
that sells analytic data platforms, marketing applications and related
services. Its analytics products are meant to consolidate data from different
sources and make the data available for analysis. Teradata marketing
applications are meant to support marketing teams that use data analytics
to inform and develop programs.
Teradata is an enterprise software company that develops and sells
a relational database management system (RDBMS) with the same name.
Teradata is publicly traded on the New York Stock Exchange (NYSE) under
the stock symbol TDC.
Teradata Products:
The Teradata product is referred to as a "data warehouse system" and
stores and manages data. The data warehouses use a "shared nothing"
architecture, which means that each server node has its own memory and
processing power. Adding more servers and nodes increases the amount
of data that can be stored. The database software sits on top of the servers
and spreads the workload among them. Teradata sells applications and
software to process different types of data. In 2010, Teradata added text
analytics to track unstructured, such as word processor documents,
and semi-structured data, such as spreadsheets.
Teradata's product can be used for business analysis. Data warehouses
can track company data, such as sales, customer preferences, product
placement, etc.
8. 8 | P a g e
Teradata Database:
Teradata is a relational database management system (RDBMS) that is:
• Teradata is an open system, running on a UNIX MP-RAS or Windows
server platform.
• Teradata is capable of supporting many concurrent users from various
client platforms.
• Teradata is compatible with industry standards (ANSI compliant).
• Teradata is completely built on a parallel architecture.
Why Teradata?
There have plenty of reasons why customers like to choose Teradata .
Teradata supports larger warehouse data than all competitors
combined.
Teradata Database can scale from 100 gigabytes to over 100+
petabytes of data on a single system without losing any performance
.This is called Scalability.
Provides a parallel-aware Optimizer that makes query tuning
unnecessary to get a query to run.
Automatic and even data distribution eliminates complex indexing
schemes or time-consuming reorganizations.
Teradata Database can handle the most concurrent users, who are
often running multiple, complex queries.
Designed and built with parallelism.
Supports ad-hoc queries using SQL
Single point of control for the DBA (Teradata Manager).
Unconditional parallelism (parallel architecture)
Teradata provides the lowest total cost (TCO) of ownership
High availability of data because there is no single point of failure -
fault tolerance is built-in to the system.
9. 9 | P a g e
Teradata Database can be used as :
Enterprise data warehousing
Active data warehousing
Customer relationship management
Internet and E-Business
Data marts.
10. 10 | P a g e
OBJECTIVE OR PURPOSE OF INTERNSHIP:
Two cogent reasons / purposes of the study are following.
1: General Purpose / Objective
To know about how people works in an organization.
To gain experience of work in Teradata which will help me in job
process.
To know what skills they want from an employee.
To see the application of our Professional studies especially.
2: Specific Purpose / Objective
Specific purpose of the study includes.
To know how the employees in large organization handle a problem.
To get a certificate from Teradata organization.
To use their database software and to check its queries.
To objectively observe the operations of Teradata in general.
11. 11 | P a g e
Interview Questions?
Tell me about your self
What Can You Do for Us That Other Candidates Can't?
What is parallelism in Teradata?
Can we load a Multi set table using MLOAD?
What is use of BI in Teradata?
What is snowflake in database?
What is star schema?
Normalization is necessary because?
De-normalization is necessary because?
What are views is database?
12. 12 | P a g e
Description of the internship:
This report is a short description of my four week internship carried out as
compulsory component of the BS in computer science. The internship was
carried out within the organization Teradata in summer 2015. As I am
interested in databases the work was concentrated on the data
warehousing.
At the beginning of the internship I formulated several learning goals, which
I wanted to achieve:
to understand the functioning and working conditions of a non-
governmental organization;
to see what is like to work in a professional environment;
to see if this kind of work is a possibility for my future career;
to use my gained skills and knowledge;
to see what skills and knowledge I still need to work in a professional
environment;
to learn about the organizing of a research project (planning,
preparation, permissions etc.)
to learn about research methodologies (field methods/methods to
analyze data)
to get fieldwork experience/collect data in an environment unknown
for me.
to enhance my communication skills;
to build a network.
This internship report contains my activities that have contributed to
achieve a number of my stated goals.
13. 13 | P a g e
1st
Week:
During the first week I just revise the databases basic concept and did
practice of writing complex queries.
This is task is given to me as my homework while in the office I was given
the training session of using the software name Tableau.
Tableau Software:
Tableau Software is an American computer software company
headquartered in Seattle, Washington. It produces a family of
interactive data visualization products focused on business intelligence.
Products:
Tableau offers five main products: Tableau Desktop, Tableau Server,
Tableau Online, Tableau Reader and Tableau Public. Tableau Public and
Tableau Reader are free to use, while both Tableau Server and Tableau
Desktop come with a 14-day fully functional free trial period, after which the
user must pay for the software. Tableau Desktop comes in both a
Professional and a lower cost Personal edition. Tableau Online is available
with an annual subscription for a single user, and scales to support
thousands of users.
14. 14 | P a g e
2nd
Week:
The below picture shows my assignment no 1.
15. 15 | P a g e
The below list was to send to me by Miss Maria it contain the name of
different companies.
18. 18 | P a g e
Conclusion:
This was my 2nd
week task which I did with full dedication and hard work.
19. 19 | P a g e
3rd
Week:
In the 3rd
and 4th
week I was the task of creating a data warehouse.
In the 3rd
week I created a schema diagram for normalized data and then
created the tables. After the creation of the database it is the time to
populate that data base with data up to 500000 per table in the normalized
database.
Here is the schema diagram for normalized data.
20. 20 | P a g e
Fact Tables:
A fact table is the central table in a star schema of a data warehouse. A
fact table stores quantitative information for analysis and is often
denormalized.
Dimension Tables:
Contrary to fact tables, dimension tables contain descriptive attributes (or
fields) that are typically textual fields (or discrete numbers that behave like
text). These attributes are designed to serve two critical purposes: query
constraining and/or filtering, and query result set labeling.
Code:
The following code is to generate data up to 500000 and store it into a text
file and then its loaded into the database tables.
#include<iostream>
#include <stdlib.h>
#include <time.h>
#include <fstream>
#include <string>
using namespace std;
static const char alphanum[]
="0123456789""ABCDEFGHIJKLMNOPQRSTUVWXYZ""abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(alphanum) - 1;
char genRandom()
{
return alphanum[rand() % stringLength];
}
int main(){
int x;
srand (time(NULL));
ofstream myfile;
myfile.open ("Name.txt");
for(int i=0;i<500000;i++){
21. 21 | P a g e
x=rand()% 8+4;
myfile<< i<<" | ";
for(int j=0;j<x;j++){
//cout<<j<<" my name is salman";
int num = rand() % 26;
char upper = static_cast<char>( 'A' + num ); // Convert to upper case
myfile <<upper ;
}
myfile<<" | ";
for(int z=0; z < 21; z++) // generate alphanumeric data
{
myfile << genRandom(); }
myfile<<"n";
}
myfile.close();
return 0;
}
22. 22 | P a g e
4th
Week:
In the last week task was to denormalized the above database, make a
warehouse and check time difference for both normalized data and
denormalized data.
Schema diagram for denormalized data.
23. 23 | P a g e
Comparison of Normalized and Denormalized queries:
Normalized Query:
Denormalized Query: