DATA WAREHOUSING SOLUTION
USING APACHE SPARK
TEAM 18
AYUSH KHANDELWAL
GAURAV PARIDA
ANIL REDDY
MEHAK AGARWAL
INTRODUCTION TO DATA WAREHOUSE
A data warehouse is constructed by integrating data from multiple heterogeneous
sources. It supports analytical reporting, structured and/or ad hoc queries and decision
making.
A data warehouse is a subject oriented, integrated, time-variant, and non-volatile
collection of data. This data helps analysts to take informed decisions in an
organization.
It is kept separate from the organization's operational database. There is no frequent
updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyze its
business.
Image taken from wikipedia.org/datawarehouse
KEY FEATURES
Subject Oriented - A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations.
Integrated - A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of data.
Time Variant - The data collected in a data warehouse is identified with a particular time period.
The data in a data warehouse provides information from the historical point of view.
Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A
data warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
DATA WAREHOUSE VS OPERATIONAL DATABASE
An operational database is constructed for well-known tasks and workloads such as
searching particular records, indexing, etc. In contract, data warehouse queries are
often complex and they present a general form of data.
Operational databases support concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms are required for operational
databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while an
OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
APACHE SPARK
Open Source
Alternative to Map Reduce for certain applications
A low latency cluster computing system
For very large data sets
May be 100 times faster than Map Reduce for
Iterative algorithms
Interactive data mining
Used with Hadoop / HDFS
Released under BSD License
SPARK FEATURES
Uses in memory cluster computing
Memory access faster than disk access
Has API's written in
Scala
Java
Python
Can be accessed from Scala and Python shells
Currently an Apache incubator project
Scales to very large clusters
Uses in memory processing for increased speed
Low latency shell access
OUR DATA WAREHOUSE SOLUTION
Building a data warehouse is a task that requires a lot of data to start, combined with
immense computational resources.
This project deals with creating a data warehouse like system which can perform basic
queries and some analytics.
Use-cases that we are dealing with:
Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc.
Movie rating progression graph
Movie recommendation engine
MOVIELENS 20M DATASET
movielens.org is a movie ratings aggregator owned by its parent company Grouplens.
Grouplens provides different sized movielens datasets for free that can be found at
http://grouplens.org/datasets/movielens/
For this project, we are using the Movielens 20M dataset which is the largest of all the
datasets provided by movielens.
Statistics about the dataset:
20 million ratings
465,000 tag applications
27,000 movies
DESCRIBING THE DATA
The data contains 4 CSV files of which only 2 are useful for this project:
movies.csv - movieid, title, genres
ratings.csv - userid, movieid, rating, timestamp
SOME IDEAS FROM HIVE
A data warehouse infrastructure built on top of hadoop for providing data
summarization, query and analysis.
Supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem.
Provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL.
FOREGROUND
Taking ideas from Apache Hive, the following solution has been proposed by us in this
project:
Dataset files are stored in HDFS.
API interface has been developed using flask instead of a graphical interface. API
rules have been defined for each query.
On hitting the URL for the API by passing the appropriate parameters, the results
are displayed in the browser window.
BACKGROUND
The dataset files are pushed to HDFS for faster access without any modifications.
For each query, the files are read from HDFS and converted to spark RDDs (Resilient
Distributed Datasets).
RDDs are a logical collection of data partitioned across machines. They can be
manipulated in parallel.
The API call is parsed for parameters, and accordingly the corresponding query
function is called.
The result of the query is handed over to flask and displayed on the browser. GraphX
has been used for plotting graph.

Cloud computing major project

  • 1.
    DATA WAREHOUSING SOLUTION USINGAPACHE SPARK TEAM 18 AYUSH KHANDELWAL GAURAV PARIDA ANIL REDDY MEHAK AGARWAL
  • 2.
    INTRODUCTION TO DATAWAREHOUSE A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. A data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. It is kept separate from the organization's operational database. There is no frequent updating done in a data warehouse. It possesses consolidated historical data, which helps the organization to analyze its business.
  • 3.
    Image taken fromwikipedia.org/datawarehouse
  • 4.
    KEY FEATURES Subject Oriented- A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. Integrated - A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data. Time Variant - The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view. Non-volatile - Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.
  • 5.
    DATA WAREHOUSE VSOPERATIONAL DATABASE An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data. An operational database maintains current data. On the other hand, a data warehouse maintains historical data.
  • 6.
    APACHE SPARK Open Source Alternativeto Map Reduce for certain applications A low latency cluster computing system For very large data sets May be 100 times faster than Map Reduce for Iterative algorithms Interactive data mining Used with Hadoop / HDFS Released under BSD License
  • 7.
    SPARK FEATURES Uses inmemory cluster computing Memory access faster than disk access Has API's written in Scala Java Python Can be accessed from Scala and Python shells Currently an Apache incubator project Scales to very large clusters Uses in memory processing for increased speed Low latency shell access
  • 8.
    OUR DATA WAREHOUSESOLUTION Building a data warehouse is a task that requires a lot of data to start, combined with immense computational resources. This project deals with creating a data warehouse like system which can perform basic queries and some analytics. Use-cases that we are dealing with: Ad-hoc queries such as “best movies of 2012”, “best comedy movies” etc. Movie rating progression graph Movie recommendation engine
  • 9.
    MOVIELENS 20M DATASET movielens.orgis a movie ratings aggregator owned by its parent company Grouplens. Grouplens provides different sized movielens datasets for free that can be found at http://grouplens.org/datasets/movielens/ For this project, we are using the Movielens 20M dataset which is the largest of all the datasets provided by movielens. Statistics about the dataset: 20 million ratings 465,000 tag applications 27,000 movies
  • 10.
    DESCRIBING THE DATA Thedata contains 4 CSV files of which only 2 are useful for this project: movies.csv - movieid, title, genres ratings.csv - userid, movieid, rating, timestamp
  • 11.
    SOME IDEAS FROMHIVE A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis. Supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.
  • 12.
    FOREGROUND Taking ideas fromApache Hive, the following solution has been proposed by us in this project: Dataset files are stored in HDFS. API interface has been developed using flask instead of a graphical interface. API rules have been defined for each query. On hitting the URL for the API by passing the appropriate parameters, the results are displayed in the browser window.
  • 13.
    BACKGROUND The dataset filesare pushed to HDFS for faster access without any modifications. For each query, the files are read from HDFS and converted to spark RDDs (Resilient Distributed Datasets). RDDs are a logical collection of data partitioned across machines. They can be manipulated in parallel. The API call is parsed for parameters, and accordingly the corresponding query function is called. The result of the query is handed over to flask and displayed on the browser. GraphX has been used for plotting graph.