M.Florence Dayana

Big Data and Analytics
Name of the Staff : M.FLORENCE DAYANA
Head, Dept. of CA
Bon Secours College for Women
Thanjavur.

• Big data analytics (BDA) is A new approach in information
management which provides a set of capabilities for
revealing additional value from BD.
• It is defined as “the process of examining large amounts of
data, from A variety of data sources and in different formats,
to deliver insights that can enable decisions in real or near
real time”.
• BDA is a different concept from those of Data Warehouse
(DW) or Business Intelligence (BI) systems.
Introduction

• The complexity of BD systems required the development of
a specialized architecture.
• Now a days, the most commonly used BD architecture is
hadoop.
• It has redefined data management because it processes large
amounts of data, timely and at a low cost.
Introduction

 Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying,
updating, information privacy and data source.
 Big data was originally associated with three key concepts:
volume, variety, and velocity.
Challenges with Big Data

Challenges
1) Capture
2) Storage
3) Duration
4) Search
5) Analysis
6) Transfer
7) Visualization
8) Privacy violations

Dealing with data growth
Data today is growing at an exponential rate.
Most of the data that we have today has been generated
the last 2-3 years.
Generating insights in an timely manner
Infrastructure for big data as far as cost- efficiency,
elasticity, and easy upgrading/downgrading is concerned.

Recruiting and retaining big data talent
The other challenges is to decide on the period of
retention of big data. Just how long should one retain this
data? A tricky question indeed as some data is useful for
making long –term decisions.

Integrating disparate data source
There is a dearth of skilled professional s who
possess a high level of proficiency in data science that is
vital in implementation big data solution.
Validating data
The data changes are highly dynamic and therefore
there is a need to ingest this as quickly as possible.

Visualization
Data visualization is becoming popular as a separate
discipline. It short by quite a number, as far as business
visualization experts are concerned.

• Introduction
The information technology sector terms the exponential amounts of
data generated in today's interconnected world as 'Big Data.'
Big data comes in many forms from metrological and astronomical
calculations and mappings to social media networks and photography sharing
networks.
Retailers, government agencies, healthcare providers and insurers,
financial institutions and other organizations collect large amounts of data on
every transaction
Every doctor's office visit or purchase, to improve the functions or
processes in which they are involved.
How Big Data Impact on IT

• The Effect of Big Data on Information Technology
Employment
New data and document control systems, software, and
infrastructure to move, process and store this information are being
developed as we speak as older systems are becoming obsolete.
Indeed, the amount of data we are generating is growing at an
exponential rate. Some of the resulting effects include:
• Employment boom for specialists and IT professionals
• Shortage of IT workers in US with specific skills to handle large
pools of data
• A developed need for employer-sponsored training programs
• Call for the government to issue visas to foreign workers in US

• More data reliant companies in the marketplace as technology evolves
• New specialty job positions emerging in the healthcare IT sector
• Special higher education programs being developed to meet future
demand in Healthcare Informatics
• While the visa issuance debate rages on, IT and Healthcare IT
recruitment companies like Talascend are helping customers find the
best-fit talent for customers and best-fit IT jobs for candidates in
retail, financial, healthcare, software, insurance, manufacturing and
other technology markets to handle these effects.

• Volume: The amount of data collected from various resources
including e-business transaction (Paypal, Payatm, Airtel Money
etc), social media (Facebook, Twitter, Whatsapp), sensor
(weather monitoring, space sensor) and machine to machine
data (networking, IoT) by millions of user around the world. To
study such massive data Hadoop provide great too.
• Velocity: The massive stored data need unprecedented speed
with time constraint. In addition to device, it should be
connected in parallel with smart sensor and metering device in
real time process to keep the transparency of data.
3 V’s of Big Data

• Variety: Data comes in two or more formats, but majorly as
structured data (numeric data in traditional databases) and
unstructured data (like stock ticker data, email, financial
transactions, audio, video and text documents)
• Variability: Inconsistency of the data set at high velocity and
in variety of data needed to be processed without hampering the
information and manage the speed at peak load of data
processing for example social media data demand increase in
morning and evening.
3 V’s of Big Data

• Complexity: The data coming from variety of sources make it
difficult to link, cleanse, match and transfer.
3 V’s of Big Data

Structured dataD
I
G
I
T
A
L
D
A
T
A
Semi Structured data
Unstructured data
Types of Digital Data

Structured Data
• This is the data which is in an organized form and can be easily
used by a computer program.
• Relationship exist between entities of data such as classes and
their objects.
• When data conforms to a pre-defined scheme/structured we say
it is structured data. data which is in an organized form and can be
easily used by a computer program.
• Relationship exist between entities of data such as classes and their objects.
• When data conforms to a pre-defined schema/structured we say it is structured data.

Sources of Structured Data
Structured data
Data base such as
oracle,DB2,Tera
data ,My SQL,etc…
Spreed sheet
OLTP systems

Semi Structured Data
• Semi structured data is also refered to as self describing
structured
I. It does not conform to the data models that one
typically associates with rlational database or any
others form of data tables
II. It uses tag s to segregrate semantic elements

Sources of Semi Structured Data
Semi structured data
XML
Other mark up language
JSON

Characteristics of Semi Structured Data
Semi structured data
Inconsistent structured data
Sell-describing
Other schema information .
Data objects may have
different attributes

Unstructured data
• Unstructured data does not conform to any
pre-defined data model.

Dealing with Unstructured Data
Dealing with un
structured data
Data mining
Nature Language
Processing
(NLP)
Text Analysis
Noisy text analysis

ApacheCassandraisan opensource,distributedanddecentralized/distributed
storagesystem, formanaging very large amountsof structureddata.
It provideshighly availableservicewithno singlepointof failure.
It is scalable,fault-tolerant,and consistent.
It is a column-orienteddatabase.
Its distributiondesignis basedonAmazon’s Dynamo and itsdatamodel on
Google’sBigtable.
Cassandra

• Cassandra implements a Dynamo-style replication model
with no single point of failure, but adds a more powerful
“column family” data model.
Cassandra is being used by some of the biggest
companies such as Facebook, Twitter, Cisco, Rackspace,
ebay, Twitter, Netflix, and more.

Thefollowing aresome ofthe featuresofCassandra:
Elasticscalability− Cassandrais highly scalable;itallows toaddmore hardware to
accommodate more customers and more dataasperrequirement.
Alwaysonarchitecture− Cassandrahas no single pointoffailureand itis continuously
availableforbusiness-criticalapplicationsthatcannot afforda failure.
Fastlinear-scaleperformance −Cassandra islinearlyscalable,i.e.,itincreasesyour
throughput as you increasethenumber ofnodesin the cluster.Thereforeitmaintains a
quick responsetime.
Features of Cassandra

Flexibledatastorage−Cassandraaccommodates allpossibledataformats including:
structured,semi-structured,and unstructured.Itcandynamically accommodate changes
to your datastructuresaccordingto your need.
Easy data distribution−Cassandraprovidesthe flexibilityto distributedatawhere you
need by replicatingdataacrossmultiple datacenters.
Transactionsupport −Cassandra supportspropertieslikeAtomicity, Consistency,
Isolation,and Durability(ACID).
Fastwrites− Cassandrawas designedtorunoncheapcommodity hardware. Itperforms
blazingly fastwrites and canstorehundredsofterabytesofdata,without sacrificing the
readefficiency.

Cassandrahas peer-to-peerdistributedsystem acrossits nodes,and datais distributed
among allthe nodesina cluster.
All the nodesin a clusterplaythe same role.Each nodeisindependentand at the same
time interconnectedto othernodes.
Eachnodeina clustercanacceptreadandwrite requests,regardlessofwhere the datais
actuallylocatedin the cluster.
Whena nodegoesdown, read/writerequestscan beserved fromothernodesin the
network.
Cassandra Architecture

Data Replicationin Cassandra
In Cassandra,oneormore ofthe nodesin a clusteractasreplicasforagiven pieceof
data.
Ifitisdetectedthatsome ofthe nodesrespondedwith an out-of-datevalue, Cassandra
will returnthe most recentvalue to the client.
After returningthe most recentvalue, Cassandraperformsa readrepairin the
background to updatethestalevalues.

Components of Cassandra
Thekeycomponents ofCassandraareas follows
1.Node
2.Datacenter
3.Cluster
4.Commit log
5.Mem-table
6.SSTable
7.Bloomfilter
Node− Itisthe placewhere datais stored.
Datacenter−Itis acollectionofrelatednodes.

Cluster−Aclusteris acomponent thatcontains oneormore datacenters.
Commit log−Thecommit log isa crash-recoverymechanism in Cassandra.Every write
operationis written to thecommit log.
Mem-table−Amem-table isa memory-resident datastructure.After commit log,the
datawill bewritten tothemem-table. Sometimes, fora single-column family, therewill
bemultiple mem-tables.
SSTable − Itis adisk fileto which the datais flushed fromthe mem-table when its
contentsreacha thresholdvalue.
Bloomfilter−Thesearenothing butquick, nondeterministic,algorithms fortesting
whether an element is amember ofaset.Itis aspecialkind ofcache.Bloomfiltersare
accessedafterevery query.

UserscanaccessCassandra throughitsnodesusing CassandraQuery Language (CQL).
CQLtreatsthedatabaseKeyspace as a containeroftables
WriteOperations
Every write activity ofnodesis capturedbythe commit logswritten in the nodes.
Captureddataarestoredin themem-table. Whenever themem-table isfull, datawill be
written into the SStabledatafile.All writes areautomaticallypartitionedand replicated
throughout the cluster
ReadOperations
During readoperations,Cassandragets values from themem-table and checks thebloom
filtertofindthe appropriateSSTablethatholdsthe requireddata.
Cassandra Query Language

Cassandra - Data Model
Thedatamodel ofCassandrais significantly different froman RDBMS.
Cluster
Cassandra databaseis distributedoverseveralmachines that operatetogether.The
outermostcontaineris known as the Cluster.
Forfailurehandling, everynodecontains areplica,and incaseofa failure,the replica
takes charge.
Cassandra arrangesthe nodesin acluster,in a ringformat, and assigns datato them.

Keyspace
Keyspace is the outermostcontainerfordatain Cassandra.Thebasicattributesofa
Keyspace inCassandra are
1.Replicationfactor
2.Replicaplacementstrategy
3.Column families
Replicationfactor−Itis thenumber ofmachines in the clusterthatwill receivecopiesof
the same data.

Replicaplacementstrategy− It isnothingbutthestrategyto placereplicasin
thering.
Strategies
1.Simple strategy(rack-aware strategy),
2.Old network topologystrategy(rack-aware strategy)and
3.Network topologystrategy(datacenter-sharedstrategy).
Column families −Keyspace is acontainerforalistofoneormore column families.A
column family, inturn,is acontainerofa collectionofrows. Eachrowcontains ordered
columns. Column families representthe structureofyour data.Each keyspace has at least
oneand oftenmany column families.

Syntax
Thesyntax ofcreatinga Keyspace is asfollows −
Schematic viewofa Keyspace.
CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};

Column Family
Acolumn family isa containerforan orderedcollectionofrows. Each row, in turn,is an
orderedcollectionofcolumns.
ACassandra column familyhas the following attributes −
keys_cached − It represents the number of locations to keep cached per
SSTable.
rows_cached − It represents the number of rows whose entire contents will be
cached in memory.
preload_row_cache − It specifies whether you want to pre-populate the row
cache.

Column
Acolumn is the basicdatastructureofCassandra with threevalues,namely key or
column name, value, and a time stamp. Given belowis thestructureofacolumn.
SuperColumn
Asupercolumn isa specialcolumn, therefore,it isalsoakey-value pair.Buta super
column storesamap ofsub-columns.
Generallycolumn families arestoredondiskinindividualfiles.

 When Big Data storages and analyzers such as MapReduce,
Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem
came into picture.
 They required a tool to interact with the relational database
servers for importing and exporting the Big Data residing
in them.
 Sqoop occupies a place in the Hadoop ecosystem to
provide feasible interaction between relational database
server and Hadoop’ s HDFS.
Introduction

SQOOP- DEFINITON
 Sqoop: “SQL to Hadoop and Hadoop to SQL”.
 Tool to transfer data from relational databases
Teradata, MySQL, PostgreSQL, Oracle, Netezza.
 It is provided by the Apache Software Foundation.

SQOOP IMPORT
 The import tool imports individual tables from RDBMS to
HDFS.
 Each row in a table is treated as a record in HDFS.
 All records are stored as text data in text files or as binary
data in Avro and Sequence files.

SQOOP EXPORT
 The export tool exports a set of files from HDFS back to an
RDBMS.
 The files given as input to Sqoop contain records, which are
called as rows in table.
 Those are read and parsed into a set of records and delimited
with user-specified delimiter.

FEATURES OF SQOOP
o Full Load.
o Incremental Load.
o Parallel import/export.
o Import results of SQL query.
o Compression.
o Connectors for all major RDBMS Databases.
o Kerberos Security Integration.

ADVANTAGES OF SQOOP
 Allows the transfer of data with a variety of structured data
stores like Postgres, Oracle, Teradata, and so on.
 Sqoop can execute the data transfer in parallel, so
execution can be quick and more cost effective.
 Helps to integrate with sequential data from the
mainframe.

DISADVANTAGES OF SQOOP
 It uses a JDBC connection to connect with RDBMS
based data stores, and this can be inefficient and less
performant.
 For performing analysis, it executes various map-reduce
jobs and, at times, this can be time consuming when
there are lot of joins if the data is in a denormalized
fashion.

Introduction
• Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic
Map Reduce.
• Hive is not A relational database A design for On Line
Transaction Processing OLTP A language for real-time queries
and row-level updates
HIVE

Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP. It provides SQL type language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

Working of Hive
• The following diagram depicts the workflow between Hive and
Hadoop.

• A social network is a structure between actors, mostly individuals or
organizations.
• It indicates the ways in which they are connected through various
social familiarities , ranging from casual acquaintance to close
familiar bonds.
Social Network

Society as a Graph
• People are represented as nodes.
• Relationship are represented as edges: relationships may be
acquaintanceship , friendship , co-authorship , etc..
• Allows analysis using tools of mathematical graph theory.

Social NetworkAnalysis
Social network analysis[SNA] is the mapping and measuring of
relationships and flows between people , groups , organizations ,
computers or other information/knowledge processing entities.

Connections
Size
Number of nodes.
Density
Number of ties that are present/the amount of ties that
could be present.
Out – degree
Sum of connections from an actor to other.
In – degree
Sum of connections of an actor.

Distance
Walk
A sequence of actors and relations that begins and ends
with actors.
Geodesic distance
The number of relations in the shortest possible walk from
one actor to another.
Maximum flow
The amount of different actors in the neighbourhood of a
source that lead to pathways to a target.

Some measures of power and prestige
Degree
sum of connections from or to an actor.
Closeness centrality
Distance of one actor to all other in the network.
Betweenness centrality
Number that represents how frequently an actor is
between other actors geodesic paths.

Social network analysis : what for?
To control information flow
To improve/stimulate communication
To improve network resilience
To trust

Centrality : strategic positions

Community identification and marketing :
1. seasonal workers
2. SMEₛ
3. students
4. school children
Customer lifestyle analysis:
Analysis based on identifying critical life stage events
using social network changes
1.going to university
2.moving
3.changing job
4.starting a relationship- moving as a couple
5.imputing demographics

BIG DATA & IOT
• Big data is more into collecting and accumulating huge data for analysis
afterward, whereas IoT is about simultaneously collecting and
processing data to make real-time decisions.
• The internet of things, or IoT, is a system of interrelated computing devices,
mechanical and digital machines, objects, animals or people that are
provided with unique identifiers (UIDs) and the ability to transfer data over a
network without requiring human-to-human or human-to-computer
interaction.

How Big Data Powers the Internet of Things
 The Internet of Things (IoT) may sound like a futuristic term, but it’s
already here and increasingly woven into our everyday lives. The concept is
simpler than you may think: If you have a smart TV, fridge, doorbell, or any
other connected device, that’s part of the IoT .
Example 1: The region’s most popular theme park has released its own app.
It does more than just provide a map, schedule, and menu items (though
those are important); it also uses GPS pings to identify app users in line, thus
being able to display predicted wait times for rides based on density, even
being able to reserve a spot or trigger attractions based on proximity.

The Connection Between Big Data and IoT
• A company’s devices are installed to use sensors for collecting and transmitting data.
• That big data—sometimes pentabytes of data—is then collected, often in an repository
called a data lake. Both structured data from prepared data sources (user profiles,
transactional information, etc.) and unstructured data from other sources (social media
archives, emails and call center notes, security camera images, licensed data, etc.) reside in
the data lake.
• Reports, charts, and other outputs are generated, sometimes by AI-driven analytics
platforms such as Oracle Analytics
• User devices provide further metrics through settings, preferences, scheduling, metadata,
and other tangible transmissions, feeding back into the data lake for even heavier volumes
of big data.

What is the Internet of Things
• According to the Global Standards Initiative on the Internet of Things
(IoT-GSI), The Internet of Things is defined as the ‘infrastructure of
the information society’. Well, simply put, it is the interconnection
and the internetworking of devices, vehicles and various other
embedded components which are collectively used to gather data and
also analyze them in real time.

How Does IoT help
• IoT can help you manage your home in a more effective way. It helps you to
keep a check on your home from a remote location.
• IoT can help in better environment monitoring by analyzing the air and the
water quality.
• IoT can help media companies to understand the behaviour of their audience
better and develop more effective content targeted towards a specific niche.

IoT Enablers
–
• RFIDs: uses radio waves in order to electronically track the tags attached to
each physical object.
• Sensors: devices that are able to detect changes in an environment (ex:
motion detectors).
• Nanotechnology: as the name suggests, these are extremely small devices
with dimensions usually less than a hundred nanometers.
• Smart networks: (ex: mesh topology).

Applications and domains
• Application Domains:
IoT is currently found in four different popular domains:
• 1) Manufacturing/Industrial business - 40.2%
• 2) Healthcare - 30.3%
• 3) Security - 7.7%
• 4) Retail - 8.3%

ModernApplications for IOT
• Smart Grids
• Smart cities
• Smart homes
• Healthcare
• Earthquake detection
• Radiation detection/hazardous gas detection
• Smartphone detection
• Water flow monitoring

Big data platforms and IOT
• Context-Aware Infrastructures for the Internet of Things
• A Study on Opportunistic Data Dissemination Support for the Internet of
Things
• Future Trends and Research Directions in Big Data Platforms for the Internet
of Things

How does IoT contribute to big data
• IOT which connect the thing to the internet by using sensors, that
the data used for analysis and monitoring also storing.
• Cloud computing helps to store and access the data without having the
larger investment in systems and software.
• so the combination of both technologies can reduce both time and money.

IoT and Big data are working together
• There are many examples of big data and IoT working well together to offer
analysis and insight. One such example is represented by shipping organizations.
They have been utilizing big data analytics and sensor data to improve efficiency,
save money and lower their environmental impact. They utilize sensors on their
delivery vehicles in order to monitor engine health, number of stops, mileage,
miles per gallon, and speed.
• IoT and big data are creating waves in big agriculture. In this area, the field
connects systems monitors to the moisture levels and transmits this data to
farmers over a wireless connection. This data will enable farmers to find out
when crops are reaching the optimum moisture levels.

Big Data Technologies
• Data storage
• Data mining
• Data analytics
• Data visualisation

Big Data Management Technologies
Now let us deal with the technologies
falling under each of these categories with
their facts and capabilities,along with the
companies which are using them.

Data Storage
• Hadoop framework was
designed to store and process
data in a distributed data
processing environment with
commodity hardware with a
simple programming model.
• It can store and analyse the
data present in different
machines with high speeds and
low costs.

Data Mining
• Presto is an open source
distributed SQL query engine
for running interactive analytic
queries against data sources
of all sizes ranging from
gigabytes to petabytes.
• Presto allows quering data in
Hive,cassendra, relational
database and proprietary
data stories.

Data Analytics
• Apache Kafka is a distributed
streaming paltform. A
streaming platform has three
key capabilities that are as
follows:
. publisher
. Subscriber
. Consumer

Data Visualisation
• Tableau is powerful and
fastest growing data
visualisation tool used in the
business intelligence
industry.
• Data analysis is very fast
with tableau and the
visualisation created are in
the form of dashboards and
worksheets.

M.Florence Dayana

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to M.Florence Dayana

Similar to M.Florence Dayana (20)

More from Dr.Florence Dayana

More from Dr.Florence Dayana (20)

Recently uploaded

Recently uploaded (20)

M.Florence Dayana