CLOUD-BASED BIG DATA
ANALYTICS
INTRODUCTION:
• With the advent of the digital age, the amount of data being
generated, stored and shared has been on the rise. From data
warehouses, social media, webpages and blogs to audio/video
streams, all of these are sources of massive amounts of data.
• This data has huge potential, ever-increasing complexity,
insecurity and risks, and irrelevance.
• Big data, by definition, is a term used to
describe a variety of data -structured, semi-
structured and unstructured, which makes it a
complex data infrastructure.
• Big data includes variety, volume, velocity
and veracity
• The different types of data available on a dataset
determine variety while the rate at which data is
produced determines Velocity.
• Predictably, the size of data is called Volume.
• Veracity indicates data reliability.
INTRODUCTION: CNTD…
INTRODUCTION: CNTD…
• The cloud computing environment offers
development, installation and
implementation of software and data
applications ‘as a service’.
• software as a service(SaaS)
• Platform as a service(PaaS)
• Infrastructure as a service(IaaS)
• Infrastructure-as-a-service is a model that
provides computing and storage resources as
a service.
• in case of PaaS and SaaS, the cloud services
provide software platform or software itself
LITERATURE SURVEY:
• Traditional data management tools and data processing or data
mining techniques cannot be used for Big Data Analytics for the
large volume and complexity of the datasets that it includes.
• Conventional business intelligence applications make use of
methods, which are based on traditional analytics methods and
techniques and make use of OLAP, BPM, Mining and database
systems like RDBMS.
• One of the most popular models used for data processing on
cluster of computers is MapReduce.
• Hadoop is simply an open-source implementation of the
MapReduce framework, which was originally created as a
distributed file system.
PROBLEM STATEMENT:
• In order to move beyond the existing techniques and strategies
used for machine learning and data analytics, some challenges
need to be overcome. NESSI identifies the following
requirements as critical.
• In order to select an adequate method or design, a solid scientific
foundation needs to be developed.
• New efficient and scalable algorithms need to be developed.
• For proper implementation of devised solutions, appropriate
development skills and technological platforms must be identified and
developed.
• Lastly, the business value of the solutions must be explored just as
much as the data structure and its usability.
PROBLEM STATEMENT:CNTD…
• This section, describes two example applications where large
scale data management over cloud is used. These are specific
use-case examples in telecom and finance.
• In the telecom domain, massive amount of call detail records
can be processed to generate near real-time network usage
information.
• In finance domain it can be describe the fraud detection
application.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
1.Dashboard for CDR Processing:
• Telecom operators are interested in building a dashboard that would
allow the analysts and architects to understand the traffic flowing
through the network along various dimensions of interest.
• The traffic is captured using Call Detail Records (CDRs) whose volume
runs into a terabyte per day.
• CDR is a structured stream generated by the telecom switches to
summarize various aspects of individual services like voice, SMS, MMS,
etc.
• The dashboard include determining the cell site used most for each
customer, identifying whether users are mostly making calls within cell
site calls, and for cell sites in rural areas identifying the source of traffic
i.e. local versus routed calls.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
1.Dashboard for CDR Processing: CNTD…
• Given the huge and ever growing customer base and large call volumes,
solutions using traditional warehouse will not be able to keep-up with
the rates required for effective operation.
• The need is to process the CDRs in near real-time, mediate them (i.e.,
collect CDRs from individual switches, stitch, validate, filter, and
normalize them), and create various indices which can be exploited by
dashboard among other applications.
• An IBM Stream Processing Language (SPL) based system leads to
mediating 6 billion CDRs per day.
• CDRs can be loaded periodically over cloud data management solution.
As cloud provides flexible storage, depending on traffic one can decide
on the storage required.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
2. Credit Card Fraud Detection:
• More than one-tenth of world’s population is shopping online. Credit
card is the most popular mode of online payments. As the number of
credit card transactions rise, the opportunities for attackers to steal
credit card details and commit fraud are also increasing.
• As the attacker only needs to know some details about the card (card
number, expiration date, etc.), the only way to detect online credit card
fraud is to analyze the spending patterns and detect any inconsistency
with respect to usual spending patterns.
• The companies keep tabs on the geographical locations where the credit
card transactions are made—if the area is far from the card holder’s area
of residence, or if two transactions from the same credit card are made
in two very distant areas within a relatively short timeframe, — then the
transactions are potentially fraud transactions.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
2. Credit Card Fraud Detection:CNTD…
• Various data mining algorithms are used to detect patterns within the
transaction data. Detecting these patterns requires the analysis of large
amount of data.
• Using tuples of the transactions, one can find the distance between
geographic locations of two consecutive transactions, amount of these
transactions, etc. By these parameters, one can find the potential
fraudulent transactions. Further data mining, based on a particular
user’s spending profile can be used to increase the confidence whether
the transaction is indeed fraudulent.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
2. Credit Card Fraud Detection:CNTD…
• As number of credit card transactions is huge and the kind of processing
required is not a typical relational processing (hence, warehouses are not
optimized to do such processing), one can use Hadoop based solution
for this purpose as depicted.
• Using Hadoop one can create customer profile as well as creating
matrices of consecutive transactions to decide whether a particular
transaction is a fraud transaction. As one needs to find the fraud with-in
some specified time, stream processing can help.
• By employing massive resources for analyzing potentially fraud
transactions one can meet the response time guarantees.
DESIGN, IMPLEMENTATION AND RESULT
ANALYSIS DETAILS:
3. Result Analysis:
• Several open source data mining techniques, resources
and tools exist. Some of these include R, Gate, Rapid-
Miner and Weka, in addition to many others.
• Cloud-based big data analytics solutions must provide
a provision for the availability of these affordable data
analytics on the cloud so that cost-effective and
efficient services can be provided.
• The fundamental reason why cloud-based analytics are
such a big thing is their easy accessibility, cost-
effectiveness and ease of setting up and testing.
CONCLUSION AND FUTURE RESEARCH
DIRECTION:
• This is an age of big data and the emergence of this field of
study has attracted the attention of many practitioners and
researchers.
• Considering the rate at which data is being created in the
digital world, big data analytics and analysis have become all
the more relevant.
• The cloud infrastructure suffices the storage and computing
requirements of data analytics algorithms. On the other hand,
open issues like security, privacy and the lack of ownership and
control exist.
• Research studies in the area of cloud-based big data analytics
THANK YOU

Cloud-Based Big Data Analytics

  • 1.
  • 2.
    INTRODUCTION: • With theadvent of the digital age, the amount of data being generated, stored and shared has been on the rise. From data warehouses, social media, webpages and blogs to audio/video streams, all of these are sources of massive amounts of data. • This data has huge potential, ever-increasing complexity, insecurity and risks, and irrelevance.
  • 3.
    • Big data,by definition, is a term used to describe a variety of data -structured, semi- structured and unstructured, which makes it a complex data infrastructure. • Big data includes variety, volume, velocity and veracity • The different types of data available on a dataset determine variety while the rate at which data is produced determines Velocity. • Predictably, the size of data is called Volume. • Veracity indicates data reliability. INTRODUCTION: CNTD…
  • 4.
    INTRODUCTION: CNTD… • Thecloud computing environment offers development, installation and implementation of software and data applications ‘as a service’. • software as a service(SaaS) • Platform as a service(PaaS) • Infrastructure as a service(IaaS) • Infrastructure-as-a-service is a model that provides computing and storage resources as a service. • in case of PaaS and SaaS, the cloud services provide software platform or software itself
  • 5.
    LITERATURE SURVEY: • Traditionaldata management tools and data processing or data mining techniques cannot be used for Big Data Analytics for the large volume and complexity of the datasets that it includes. • Conventional business intelligence applications make use of methods, which are based on traditional analytics methods and techniques and make use of OLAP, BPM, Mining and database systems like RDBMS. • One of the most popular models used for data processing on cluster of computers is MapReduce. • Hadoop is simply an open-source implementation of the MapReduce framework, which was originally created as a distributed file system.
  • 6.
    PROBLEM STATEMENT: • Inorder to move beyond the existing techniques and strategies used for machine learning and data analytics, some challenges need to be overcome. NESSI identifies the following requirements as critical. • In order to select an adequate method or design, a solid scientific foundation needs to be developed. • New efficient and scalable algorithms need to be developed. • For proper implementation of devised solutions, appropriate development skills and technological platforms must be identified and developed. • Lastly, the business value of the solutions must be explored just as much as the data structure and its usability.
  • 7.
    PROBLEM STATEMENT:CNTD… • Thissection, describes two example applications where large scale data management over cloud is used. These are specific use-case examples in telecom and finance. • In the telecom domain, massive amount of call detail records can be processed to generate near real-time network usage information. • In finance domain it can be describe the fraud detection application.
  • 8.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 1.Dashboard for CDR Processing: • Telecom operators are interested in building a dashboard that would allow the analysts and architects to understand the traffic flowing through the network along various dimensions of interest. • The traffic is captured using Call Detail Records (CDRs) whose volume runs into a terabyte per day. • CDR is a structured stream generated by the telecom switches to summarize various aspects of individual services like voice, SMS, MMS, etc. • The dashboard include determining the cell site used most for each customer, identifying whether users are mostly making calls within cell site calls, and for cell sites in rural areas identifying the source of traffic i.e. local versus routed calls.
  • 9.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 1.Dashboard for CDR Processing: CNTD… • Given the huge and ever growing customer base and large call volumes, solutions using traditional warehouse will not be able to keep-up with the rates required for effective operation. • The need is to process the CDRs in near real-time, mediate them (i.e., collect CDRs from individual switches, stitch, validate, filter, and normalize them), and create various indices which can be exploited by dashboard among other applications. • An IBM Stream Processing Language (SPL) based system leads to mediating 6 billion CDRs per day. • CDRs can be loaded periodically over cloud data management solution. As cloud provides flexible storage, depending on traffic one can decide on the storage required.
  • 10.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 2. Credit Card Fraud Detection: • More than one-tenth of world’s population is shopping online. Credit card is the most popular mode of online payments. As the number of credit card transactions rise, the opportunities for attackers to steal credit card details and commit fraud are also increasing. • As the attacker only needs to know some details about the card (card number, expiration date, etc.), the only way to detect online credit card fraud is to analyze the spending patterns and detect any inconsistency with respect to usual spending patterns. • The companies keep tabs on the geographical locations where the credit card transactions are made—if the area is far from the card holder’s area of residence, or if two transactions from the same credit card are made in two very distant areas within a relatively short timeframe, — then the transactions are potentially fraud transactions.
  • 11.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 2. Credit Card Fraud Detection:CNTD… • Various data mining algorithms are used to detect patterns within the transaction data. Detecting these patterns requires the analysis of large amount of data. • Using tuples of the transactions, one can find the distance between geographic locations of two consecutive transactions, amount of these transactions, etc. By these parameters, one can find the potential fraudulent transactions. Further data mining, based on a particular user’s spending profile can be used to increase the confidence whether the transaction is indeed fraudulent.
  • 12.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 2. Credit Card Fraud Detection:CNTD… • As number of credit card transactions is huge and the kind of processing required is not a typical relational processing (hence, warehouses are not optimized to do such processing), one can use Hadoop based solution for this purpose as depicted. • Using Hadoop one can create customer profile as well as creating matrices of consecutive transactions to decide whether a particular transaction is a fraud transaction. As one needs to find the fraud with-in some specified time, stream processing can help. • By employing massive resources for analyzing potentially fraud transactions one can meet the response time guarantees.
  • 13.
    DESIGN, IMPLEMENTATION ANDRESULT ANALYSIS DETAILS: 3. Result Analysis: • Several open source data mining techniques, resources and tools exist. Some of these include R, Gate, Rapid- Miner and Weka, in addition to many others. • Cloud-based big data analytics solutions must provide a provision for the availability of these affordable data analytics on the cloud so that cost-effective and efficient services can be provided. • The fundamental reason why cloud-based analytics are such a big thing is their easy accessibility, cost- effectiveness and ease of setting up and testing.
  • 15.
    CONCLUSION AND FUTURERESEARCH DIRECTION: • This is an age of big data and the emergence of this field of study has attracted the attention of many practitioners and researchers. • Considering the rate at which data is being created in the digital world, big data analytics and analysis have become all the more relevant. • The cloud infrastructure suffices the storage and computing requirements of data analytics algorithms. On the other hand, open issues like security, privacy and the lack of ownership and control exist. • Research studies in the area of cloud-based big data analytics
  • 16.