Big data analytics (BDA) provides capabilities for revealing additional value from big data. It examines large amounts of data from various sources to deliver insights that enable real-time decisions. BDA is different from data warehousing and business intelligence systems. The complexity of big data systems required developing specialized architectures like Hadoop, which processes large amounts of data in a timely and low cost manner. Big data challenges include capturing, storing, analyzing, sharing, transferring, visualizing, querying, updating, and ensuring privacy of large and diverse datasets.
PAARL's 1st Marina G. Dayrit Lecture Series held at UP's Melchor Hall, 5F, Proctor & Gamble Audiovisual Hall, College of Engineering, on 3 March 2017, with Albert Anthony D. Gavino of Smart Communications Inc. as resource speaker on the topic "Using Big Data to Enhance Library Services"
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
The era of Big data is being generated by everything around us at all times. Every digital process and social media
exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming
velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics
capabilities and skills. Big data has become an important issue for a large number of research areas such as data mining,
machine learning, computational intelligence, information fusion, the semantic Web, and social networks. The combination of
big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas
as social media and social networks. These new challenges are focused mainly on problems such as data processing, data
storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and
tracking data, among others. In this paper, discussion about the new concept big data and data analytic their concept, tools
and methodologies that is designed to allow for efficient data mining and information sharing fusion from social media and of
the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media
and big data paradigms.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
PAARL's 1st Marina G. Dayrit Lecture Series held at UP's Melchor Hall, 5F, Proctor & Gamble Audiovisual Hall, College of Engineering, on 3 March 2017, with Albert Anthony D. Gavino of Smart Communications Inc. as resource speaker on the topic "Using Big Data to Enhance Library Services"
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
The era of Big data is being generated by everything around us at all times. Every digital process and social media
exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming
velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics
capabilities and skills. Big data has become an important issue for a large number of research areas such as data mining,
machine learning, computational intelligence, information fusion, the semantic Web, and social networks. The combination of
big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas
as social media and social networks. These new challenges are focused mainly on problems such as data processing, data
storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and
tracking data, among others. In this paper, discussion about the new concept big data and data analytic their concept, tools
and methodologies that is designed to allow for efficient data mining and information sharing fusion from social media and of
the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media
and big data paradigms.
Class lecture by Prof. Raj Jain on Big Data. The talk covers Why Big Data Now?, Big Data Applications, ACID Requirements, Terminology, Google File System, BigTable, MapReduce, MapReduce Optimization, Story of Hadoop, Hadoop, Apache Hadoop Tools, Apache Other Big Data Tools, Other Big Data Tools, Analytics, Types of Databases, Relational Databases and SQL, Non-relational Databases, NewSQL Databases, Columnar Databases. Video recording available in YouTube.
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
Auspiciously, big data analytics had made it possible to generate value from immense amounts of raw data. Organizations are able to seek incredible insights which assist them in effective decision making and providing quality of service by establishing innovative strategies to recognize, examine and address the customers’ preferences. However, organizations are reluctant to adopt big data solutions due to several barriers such as data storage and transfer, scalability, data quality, data complexity, timeliness, security, privacy, trust, data ownership, and transparency. Despite the discussion on big data opportunities, in this paper, we present the findings of our in-depth review process that was focused on identifying as well as analyzing the transient and permanent barriers for adopting big data. Although, the transient barriers for big data can be eliminated in the near future with the advent of innovative technical contributions, however, it is challenging to eliminate the permanent barriers enduringly, though their impact could be recurrently reduced with the efficient and effective use of technology, standards, policies, and procedures.
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
Abstract: Big data refers to the organizational data asset that exceeds the volume, velocity, and variety of data typically stored using traditional structured database technologies. This type of data has become the important resource from which organizations can get valuable insightand make business decision by applying predictive analysis. This paper provides a comprehensive view of current status of big data development,starting from the definition and the description of Hadoop and MapReduce – the framework that standardizes the use of cluster of commodity machines to analyze big data. For the organizations that are ready to embrace big data technology, significant adjustments on infrastructure andthe roles played byIT professionals and BI practitioners must be anticipated which is discussed in the challenges of big data section. The landscape of big data development change rapidly which is directly related to the trend of big data. Clearly, a major part of the trend is the result ofthe attempt to deal with the challenges discussed earlier. Lastly the paper includes the most recent job prospective related to big data. The description of several job titles that comprise the workforce in the area of big data are also included.
This article useful for anyone who want to introduce with Big Data and how oracle architecture Big Data solution using Oracle Big Data Cloud solutions .
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
Data warehousing is a technique for collecting and managing data from multiple internal and
external sources to provide meaningful business insights. Data warehouses are designed to give a long-range
view of data over time and provide a decision support system environment. They are a vital component of
business intelligence, which is designed for data analysis and reporting. They are used to provide greater
insight into the performance of a business. This paper provides a brief introduction on data warehousing
Data mining Course
Chapter 1
Definition of Data Mining
Data Mining as an Interdisciplinary field
The process of Data Mining
Data Mining Tasks
Challenges of Data Mining
Data mining application examples
Introduction to RapidMiner
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
Auspiciously, big data analytics had made it possible to generate value from immense amounts of raw data. Organizations are able to seek incredible insights which assist them in effective decision making and providing quality of service by establishing innovative strategies to recognize, examine and address the customers’ preferences. However, organizations are reluctant to adopt big data solutions due to several barriers such as data storage and transfer, scalability, data quality, data complexity, timeliness, security, privacy, trust, data ownership, and transparency. Despite the discussion on big data opportunities, in this paper, we present the findings of our in-depth review process that was focused on identifying as well as analyzing the transient and permanent barriers for adopting big data. Although, the transient barriers for big data can be eliminated in the near future with the advent of innovative technical contributions, however, it is challenging to eliminate the permanent barriers enduringly, though their impact could be recurrently reduced with the efficient and effective use of technology, standards, policies, and procedures.
The slide aids to understand and provide insights on the following topics,
* Overview for Data Science
* Definition of Data and Information
* Types of Data and Representation
* Data Value Chain - [ Data Acquisition; Data Analysis; Data Curating; Data Storage; Data Usage ]
* Basic concepts of Big Data
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
Abstract: Big data refers to the organizational data asset that exceeds the volume, velocity, and variety of data typically stored using traditional structured database technologies. This type of data has become the important resource from which organizations can get valuable insightand make business decision by applying predictive analysis. This paper provides a comprehensive view of current status of big data development,starting from the definition and the description of Hadoop and MapReduce – the framework that standardizes the use of cluster of commodity machines to analyze big data. For the organizations that are ready to embrace big data technology, significant adjustments on infrastructure andthe roles played byIT professionals and BI practitioners must be anticipated which is discussed in the challenges of big data section. The landscape of big data development change rapidly which is directly related to the trend of big data. Clearly, a major part of the trend is the result ofthe attempt to deal with the challenges discussed earlier. Lastly the paper includes the most recent job prospective related to big data. The description of several job titles that comprise the workforce in the area of big data are also included.
This article useful for anyone who want to introduce with Big Data and how oracle architecture Big Data solution using Oracle Big Data Cloud solutions .
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
Data warehousing is a technique for collecting and managing data from multiple internal and
external sources to provide meaningful business insights. Data warehouses are designed to give a long-range
view of data over time and provide a decision support system environment. They are a vital component of
business intelligence, which is designed for data analysis and reporting. They are used to provide greater
insight into the performance of a business. This paper provides a brief introduction on data warehousing
Data mining Course
Chapter 1
Definition of Data Mining
Data Mining as an Interdisciplinary field
The process of Data Mining
Data Mining Tasks
Challenges of Data Mining
Data mining application examples
Introduction to RapidMiner
What is big data?
Big data is a mix of structured, semi-structured, and unstructured data gathered by organizations that can be dug for data and used in machine learning projects, predictive modeling, and other advanced analytics applications.
Systems that process and store big data have turned into a typical part of data the board architectures in organizations, joined with tools that support big data analytics uses. Big data is regularly portrayed by the three V's:
the enormous volume of data in numerous environments; • the wide variety of data types regularly stored in big data systems, and
the velocity at which a significant part of the data is created, gathered and processed.
These characteristics were first recognized in 2001 by Doug Laney, then, at that point, an analyst at consulting firm Meta Group Inc.; Gartner further promoted them after it gained Meta Group in 2005. All the more as of late, several other V's have been added to various descriptions of big data, including veracity, value and variability.
Albeit big data doesn't liken to a specific volume of data, big data deployments frequently involve terabytes, petabytes, and even exabytes of data made and gathered over time.
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.
to effectively analyze this kind of information is now seen as a key competitive advantage to better inform decisions. In order to do so, organizations employ Sentiment Analysis (SA) techniques on these data. However, the usage of social media around the world is ever-increasing, which considerably accelerates massive data generation and makes traditional SA systems unable to deliver useful insights. Such volume of data can be efficiently analyzed using the combination of SA techniques and Big Data technologies. In fact, big data is not a luxury but an essential necessary to make valuable predictions. However, there are some challenges associated with big data such as quality that could highly affect the SA systems’ accuracy that use huge volume of data. Thus, the quality aspect should be addressed in order to build reliable and credible systems. For this, the goal of our research work is to consider Big Data Quality Metrics (BDQM) in SA that rely of big data. In this paper, we first highlight the most eloquent BDQM that should be considered throughout the Big Data Value Chain (BDVC) in any big data project. Then, we measure the impact of BDQM on a novel SA method accuracy in a real case study by giving simulation results.
Big data is a mix of structured, semistructured, and unstructured data gathered by organizations that can be dug for data and used in machine learning projects,
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.
Introduction to Cloud
Definition
Vision on Cloud Computing
A Closer Look-NewYork Times, Washington Post, private cloud,Public Cloud, Hybrid Cloud, Reference Model, Actors in Cloud Computing, Characteristics and Benefits, Challenges Ahead, History of Cloud Computing, Distributed system, Virtualization, PROS and CONS of Cloud Computing. Technology Examples
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
M. FLORENCE DAYANA/unit - II logic gates and circuits.pdfDr.Florence Dayana
Logic Gates, Truth Table, AND Gate
Types of Digital Logic AND Gate, The 2-input and 3-input AND Gate, OR Gate, Types of Digital Logic AND Gate, The 2-input OR gate, The 3-input OR gate, NOT Gate, NAND Gate, The 2-input NAND Gate, The 3-input NAND Gate, NOR Gate, 2-input NOR gate
Just like other gates, XOR gate or Exclusive-OR gate
Reading, Pre Task, Reading Strategies, Types of reading, Reading Comprehension, Questions, Comparison, Group Discussion, Identify the Meaning, positive vibration, vocabulary
Listening, form of communication, Process Description, Definition, Model Video for Listening, Questions, Procedure for Flowchart, Pre Listening, Post Listening, Motivational video, comparison video
Input Devices-Keyboard, Mouse, Trackball, Joystick, Scanner and Types, Barcode Reader, Voice Recognition, Web Camera, Optical character recognition, Optical Mark recognition, Monitor, Printer and Types, Plotter
Definition, SSL Concepts Connection and Service, SSL Architecture, SSL Record Protocol, Record Format, Higher Layer Protocol, Handshake Protocol- Change Cipher Specification and lert Protocol
Introduction, networking, types of network, connections, packet switching, open systems, protocols, firewalls, mime types, addresses, domain name system
XML Introduction,Syntax of XML,Well formed XML Documents,XML Document Structure,Document Type Definitions,XML Namespace,XML Schemas,DOM(Document Object Model)
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
1. Big Data and Analytics
Name of the Staff : M.FLORENCE DAYANA
Head, Dept. of CA
Bon Secours College for Women
Thanjavur.
2.
3. • Big data analytics (BDA) is A new approach in information
management which provides a set of capabilities for
revealing additional value from BD.
• It is defined as “the process of examining large amounts of
data, from A variety of data sources and in different formats,
to deliver insights that can enable decisions in real or near
real time”.
• BDA is a different concept from those of Data Warehouse
(DW) or Business Intelligence (BI) systems.
Introduction
4. • The complexity of BD systems required the development of
a specialized architecture.
• Now a days, the most commonly used BD architecture is
hadoop.
• It has redefined data management because it processes large
amounts of data, timely and at a low cost.
Introduction
5.
6. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying,
updating, information privacy and data source.
Big data was originally associated with three key concepts:
volume, variety, and velocity.
Challenges with Big Data
8. Dealing with data growth
Data today is growing at an exponential rate.
Most of the data that we have today has been generated
the last 2-3 years.
Generating insights in an timely manner
Infrastructure for big data as far as cost- efficiency,
elasticity, and easy upgrading/downgrading is concerned.
9. Recruiting and retaining big data talent
The other challenges is to decide on the period of
retention of big data. Just how long should one retain this
data? A tricky question indeed as some data is useful for
making long –term decisions.
10. Integrating disparate data source
There is a dearth of skilled professional s who
possess a high level of proficiency in data science that is
vital in implementation big data solution.
Validating data
The data changes are highly dynamic and therefore
there is a need to ingest this as quickly as possible.
11. Visualization
Data visualization is becoming popular as a separate
discipline. It short by quite a number, as far as business
visualization experts are concerned.
12. • Introduction
The information technology sector terms the exponential amounts of
data generated in today's interconnected world as 'Big Data.'
Big data comes in many forms from metrological and astronomical
calculations and mappings to social media networks and photography sharing
networks.
Retailers, government agencies, healthcare providers and insurers,
financial institutions and other organizations collect large amounts of data on
every transaction
Every doctor's office visit or purchase, to improve the functions or
processes in which they are involved.
How Big Data Impact on IT
13. • The Effect of Big Data on Information Technology
Employment
New data and document control systems, software, and
infrastructure to move, process and store this information are being
developed as we speak as older systems are becoming obsolete.
Indeed, the amount of data we are generating is growing at an
exponential rate. Some of the resulting effects include:
• Employment boom for specialists and IT professionals
• Shortage of IT workers in US with specific skills to handle large
pools of data
• A developed need for employer-sponsored training programs
• Call for the government to issue visas to foreign workers in US
14. • More data reliant companies in the marketplace as technology evolves
• New specialty job positions emerging in the healthcare IT sector
• Special higher education programs being developed to meet future
demand in Healthcare Informatics
• While the visa issuance debate rages on, IT and Healthcare IT
recruitment companies like Talascend are helping customers find the
best-fit talent for customers and best-fit IT jobs for candidates in
retail, financial, healthcare, software, insurance, manufacturing and
other technology markets to handle these effects.
15.
16.
17.
18.
19.
20.
21. • Volume: The amount of data collected from various resources
including e-business transaction (Paypal, Payatm, Airtel Money
etc), social media (Facebook, Twitter, Whatsapp), sensor
(weather monitoring, space sensor) and machine to machine
data (networking, IoT) by millions of user around the world. To
study such massive data Hadoop provide great too.
• Velocity: The massive stored data need unprecedented speed
with time constraint. In addition to device, it should be
connected in parallel with smart sensor and metering device in
real time process to keep the transparency of data.
3 V’s of Big Data
22. • Variety: Data comes in two or more formats, but majorly as
structured data (numeric data in traditional databases) and
unstructured data (like stock ticker data, email, financial
transactions, audio, video and text documents)
• Variability: Inconsistency of the data set at high velocity and
in variety of data needed to be processed without hampering the
information and manage the speed at peak load of data
processing for example social media data demand increase in
morning and evening.
3 V’s of Big Data
23. • Complexity: The data coming from variety of sources make it
difficult to link, cleanse, match and transfer.
3 V’s of Big Data
25. Structured Data
• This is the data which is in an organized form and can be easily
used by a computer program.
• Relationship exist between entities of data such as classes and
their objects.
• When data conforms to a pre-defined scheme/structured we say
it is structured data. data which is in an organized form and can be
easily used by a computer program.
• Relationship exist between entities of data such as classes and their objects.
• When data conforms to a pre-defined schema/structured we say it is structured data.
26. Sources of Structured Data
Structured data
Data base such as
oracle,DB2,Tera
data ,My SQL,etc…
Spreed sheet
OLTP systems
27. Semi Structured Data
• Semi structured data is also refered to as self describing
structured
I. It does not conform to the data models that one
typically associates with rlational database or any
others form of data tables
II. It uses tag s to segregrate semantic elements
28. Sources of Semi Structured Data
Semi structured data
XML
Other mark up language
JSON
29. Characteristics of Semi Structured Data
Semi structured data
Inconsistent structured data
Sell-describing
Other schema information .
Data objects may have
different attributes
44. • Cassandra implements a Dynamo-style replication model
with no single point of failure, but adds a more powerful
“column family” data model.
Cassandra is being used by some of the biggest
companies such as Facebook, Twitter, Cisco, Rackspace,
ebay, Twitter, Netflix, and more.
45. Thefollowing aresome ofthe featuresofCassandra:
Elasticscalability− Cassandrais highly scalable;itallows toaddmore hardware to
accommodate more customers and more dataasperrequirement.
Alwaysonarchitecture− Cassandrahas no single pointoffailureand itis continuously
availableforbusiness-criticalapplicationsthatcannot afforda failure.
Fastlinear-scaleperformance −Cassandra islinearlyscalable,i.e.,itincreasesyour
throughput as you increasethenumber ofnodesin the cluster.Thereforeitmaintains a
quick responsetime.
Features of Cassandra
46. Flexibledatastorage−Cassandraaccommodates allpossibledataformats including:
structured,semi-structured,and unstructured.Itcandynamically accommodate changes
to your datastructuresaccordingto your need.
Easy data distribution−Cassandraprovidesthe flexibilityto distributedatawhere you
need by replicatingdataacrossmultiple datacenters.
Transactionsupport −Cassandra supportspropertieslikeAtomicity, Consistency,
Isolation,and Durability(ACID).
Fastwrites− Cassandrawas designedtorunoncheapcommodity hardware. Itperforms
blazingly fastwrites and canstorehundredsofterabytesofdata,without sacrificing the
readefficiency.
47. Cassandrahas peer-to-peerdistributedsystem acrossits nodes,and datais distributed
among allthe nodesina cluster.
All the nodesin a clusterplaythe same role.Each nodeisindependentand at the same
time interconnectedto othernodes.
Eachnodeina clustercanacceptreadandwrite requests,regardlessofwhere the datais
actuallylocatedin the cluster.
Whena nodegoesdown, read/writerequestscan beserved fromothernodesin the
network.
Cassandra Architecture
48. Data Replicationin Cassandra
In Cassandra,oneormore ofthe nodesin a clusteractasreplicasforagiven pieceof
data.
Ifitisdetectedthatsome ofthe nodesrespondedwith an out-of-datevalue, Cassandra
will returnthe most recentvalue to the client.
After returningthe most recentvalue, Cassandraperformsa readrepairin the
background to updatethestalevalues.
51. Cluster−Aclusteris acomponent thatcontains oneormore datacenters.
Commit log−Thecommit log isa crash-recoverymechanism in Cassandra.Every write
operationis written to thecommit log.
Mem-table−Amem-table isa memory-resident datastructure.After commit log,the
datawill bewritten tothemem-table. Sometimes, fora single-column family, therewill
bemultiple mem-tables.
SSTable − Itis adisk fileto which the datais flushed fromthe mem-table when its
contentsreacha thresholdvalue.
Bloomfilter−Thesearenothing butquick, nondeterministic,algorithms fortesting
whether an element is amember ofaset.Itis aspecialkind ofcache.Bloomfiltersare
accessedafterevery query.
52. UserscanaccessCassandra throughitsnodesusing CassandraQuery Language (CQL).
CQLtreatsthedatabaseKeyspace as a containeroftables
WriteOperations
Every write activity ofnodesis capturedbythe commit logswritten in the nodes.
Captureddataarestoredin themem-table. Whenever themem-table isfull, datawill be
written into the SStabledatafile.All writes areautomaticallypartitionedand replicated
throughout the cluster
ReadOperations
During readoperations,Cassandragets values from themem-table and checks thebloom
filtertofindthe appropriateSSTablethatholdsthe requireddata.
Cassandra Query Language
53. Cassandra - Data Model
Thedatamodel ofCassandrais significantly different froman RDBMS.
Cluster
Cassandra databaseis distributedoverseveralmachines that operatetogether.The
outermostcontaineris known as the Cluster.
Forfailurehandling, everynodecontains areplica,and incaseofa failure,the replica
takes charge.
Cassandra arrangesthe nodesin acluster,in a ringformat, and assigns datato them.
54. Keyspace
Keyspace is the outermostcontainerfordatain Cassandra.Thebasicattributesofa
Keyspace inCassandra are
1.Replicationfactor
2.Replicaplacementstrategy
3.Column families
Replicationfactor−Itis thenumber ofmachines in the clusterthatwill receivecopiesof
the same data.
55. Replicaplacementstrategy− It isnothingbutthestrategyto placereplicasin
thering.
Strategies
1.Simple strategy(rack-aware strategy),
2.Old network topologystrategy(rack-aware strategy)and
3.Network topologystrategy(datacenter-sharedstrategy).
Column families −Keyspace is acontainerforalistofoneormore column families.A
column family, inturn,is acontainerofa collectionofrows. Eachrowcontains ordered
columns. Column families representthe structureofyour data.Each keyspace has at least
oneand oftenmany column families.
56. Syntax
Thesyntax ofcreatinga Keyspace is asfollows −
Schematic viewofa Keyspace.
CREATE KEYSPACE Keyspace name WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
57. Column Family
Acolumn family isa containerforan orderedcollectionofrows. Each row, in turn,is an
orderedcollectionofcolumns.
ACassandra column familyhas the following attributes −
keys_cached − It represents the number of locations to keep cached per
SSTable.
rows_cached − It represents the number of rows whose entire contents will be
cached in memory.
preload_row_cache − It specifies whether you want to pre-populate the row
cache.
59. Column
Acolumn is the basicdatastructureofCassandra with threevalues,namely key or
column name, value, and a time stamp. Given belowis thestructureofacolumn.
SuperColumn
Asupercolumn isa specialcolumn, therefore,it isalsoakey-value pair.Buta super
column storesamap ofsub-columns.
Generallycolumn families arestoredondiskinindividualfiles.
62. When Big Data storages and analyzers such as MapReduce,
Hive, HBase, Cassandra, Pig, etc. of the Hadoop ecosystem
came into picture.
They required a tool to interact with the relational database
servers for importing and exporting the Big Data residing
in them.
Sqoop occupies a place in the Hadoop ecosystem to
provide feasible interaction between relational database
server and Hadoop’ s HDFS.
Introduction
63. SQOOP- DEFINITON
Sqoop: “SQL to Hadoop and Hadoop to SQL”.
Tool to transfer data from relational databases
Teradata, MySQL, PostgreSQL, Oracle, Netezza.
It is provided by the Apache Software Foundation.
66. SQOOP IMPORT
The import tool imports individual tables from RDBMS to
HDFS.
Each row in a table is treated as a record in HDFS.
All records are stored as text data in text files or as binary
data in Avro and Sequence files.
67. SQOOP EXPORT
The export tool exports a set of files from HDFS back to an
RDBMS.
The files given as input to Sqoop contain records, which are
called as rows in table.
Those are read and parsed into a set of records and delimited
with user-specified delimiter.
68. FEATURES OF SQOOP
o Full Load.
o Incremental Load.
o Parallel import/export.
o Import results of SQL query.
o Compression.
o Connectors for all major RDBMS Databases.
o Kerberos Security Integration.
69. ADVANTAGES OF SQOOP
Allows the transfer of data with a variety of structured data
stores like Postgres, Oracle, Teradata, and so on.
Sqoop can execute the data transfer in parallel, so
execution can be quick and more cost effective.
Helps to integrate with sequential data from the
mainframe.
70. DISADVANTAGES OF SQOOP
It uses a JDBC connection to connect with RDBMS
based data stores, and this can be inefficient and less
performant.
For performing analysis, it executes various map-reduce
jobs and, at times, this can be time consuming when
there are lot of joins if the data is in a denormalized
fashion.
71. Introduction
• Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an
open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic
Map Reduce.
• Hive is not A relational database A design for On Line
Transaction Processing OLTP A language for real-time queries
and row-level updates
HIVE
72. Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP. It provides SQL type language for
querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
74. Working of Hive
• The following diagram depicts the workflow between Hive and
Hadoop.
75. • A social network is a structure between actors, mostly individuals or
organizations.
• It indicates the ways in which they are connected through various
social familiarities , ranging from casual acquaintance to close
familiar bonds.
Social Network
76. Society as a Graph
• People are represented as nodes.
• Relationship are represented as edges: relationships may be
acquaintanceship , friendship , co-authorship , etc..
• Allows analysis using tools of mathematical graph theory.
77. Social NetworkAnalysis
Social network analysis[SNA] is the mapping and measuring of
relationships and flows between people , groups , organizations ,
computers or other information/knowledge processing entities.
78. Connections
Size
Number of nodes.
Density
Number of ties that are present/the amount of ties that
could be present.
Out – degree
Sum of connections from an actor to other.
In – degree
Sum of connections of an actor.
79. Distance
Walk
A sequence of actors and relations that begins and ends
with actors.
Geodesic distance
The number of relations in the shortest possible walk from
one actor to another.
Maximum flow
The amount of different actors in the neighbourhood of a
source that lead to pathways to a target.
80. Some measures of power and prestige
Degree
sum of connections from or to an actor.
Closeness centrality
Distance of one actor to all other in the network.
Betweenness centrality
Number that represents how frequently an actor is
between other actors geodesic paths.
81. Social network analysis : what for?
To control information flow
To improve/stimulate communication
To improve network resilience
To trust
88. Community identification and marketing :
1. seasonal workers
2. SMEₛ
3. students
4. school children
Customer lifestyle analysis:
Analysis based on identifying critical life stage events
using social network changes
1.going to university
2.moving
3.changing job
4.starting a relationship- moving as a couple
5.imputing demographics
89. BIG DATA & IOT
• Big data is more into collecting and accumulating huge data for analysis
afterward, whereas IoT is about simultaneously collecting and
processing data to make real-time decisions.
• The internet of things, or IoT, is a system of interrelated computing devices,
mechanical and digital machines, objects, animals or people that are
provided with unique identifiers (UIDs) and the ability to transfer data over a
network without requiring human-to-human or human-to-computer
interaction.
90. How Big Data Powers the Internet of Things
The Internet of Things (IoT) may sound like a futuristic term, but it’s
already here and increasingly woven into our everyday lives. The concept is
simpler than you may think: If you have a smart TV, fridge, doorbell, or any
other connected device, that’s part of the IoT .
Example 1: The region’s most popular theme park has released its own app.
It does more than just provide a map, schedule, and menu items (though
those are important); it also uses GPS pings to identify app users in line, thus
being able to display predicted wait times for rides based on density, even
being able to reserve a spot or trigger attractions based on proximity.
91. The Connection Between Big Data and IoT
• A company’s devices are installed to use sensors for collecting and transmitting data.
• That big data—sometimes pentabytes of data—is then collected, often in an repository
called a data lake. Both structured data from prepared data sources (user profiles,
transactional information, etc.) and unstructured data from other sources (social media
archives, emails and call center notes, security camera images, licensed data, etc.) reside in
the data lake.
• Reports, charts, and other outputs are generated, sometimes by AI-driven analytics
platforms such as Oracle Analytics
• User devices provide further metrics through settings, preferences, scheduling, metadata,
and other tangible transmissions, feeding back into the data lake for even heavier volumes
of big data.
92. What is the Internet of Things
• According to the Global Standards Initiative on the Internet of Things
(IoT-GSI), The Internet of Things is defined as the ‘infrastructure of
the information society’. Well, simply put, it is the interconnection
and the internetworking of devices, vehicles and various other
embedded components which are collectively used to gather data and
also analyze them in real time.
93. How Does IoT help
• IoT can help you manage your home in a more effective way. It helps you to
keep a check on your home from a remote location.
• IoT can help in better environment monitoring by analyzing the air and the
water quality.
• IoT can help media companies to understand the behaviour of their audience
better and develop more effective content targeted towards a specific niche.
94.
95.
96. IoT Enablers
–
• RFIDs: uses radio waves in order to electronically track the tags attached to
each physical object.
• Sensors: devices that are able to detect changes in an environment (ex:
motion detectors).
• Nanotechnology: as the name suggests, these are extremely small devices
with dimensions usually less than a hundred nanometers.
• Smart networks: (ex: mesh topology).
97. Applications and domains
• Application Domains:
IoT is currently found in four different popular domains:
• 1) Manufacturing/Industrial business - 40.2%
• 2) Healthcare - 30.3%
• 3) Security - 7.7%
• 4) Retail - 8.3%
98. ModernApplications for IOT
• Smart Grids
• Smart cities
• Smart homes
• Healthcare
• Earthquake detection
• Radiation detection/hazardous gas detection
• Smartphone detection
• Water flow monitoring
99.
100. Big data platforms and IOT
• Context-Aware Infrastructures for the Internet of Things
• A Study on Opportunistic Data Dissemination Support for the Internet of
Things
• Future Trends and Research Directions in Big Data Platforms for the Internet
of Things
101. How does IoT contribute to big data
• IOT which connect the thing to the internet by using sensors, that
the data used for analysis and monitoring also storing.
• Cloud computing helps to store and access the data without having the
larger investment in systems and software.
• so the combination of both technologies can reduce both time and money.
102. IoT and Big data are working together
• There are many examples of big data and IoT working well together to offer
analysis and insight. One such example is represented by shipping organizations.
They have been utilizing big data analytics and sensor data to improve efficiency,
save money and lower their environmental impact. They utilize sensors on their
delivery vehicles in order to monitor engine health, number of stops, mileage,
miles per gallon, and speed.
• IoT and big data are creating waves in big agriculture. In this area, the field
connects systems monitors to the moisture levels and transmits this data to
farmers over a wireless connection. This data will enable farmers to find out
when crops are reaching the optimum moisture levels.
104. Big Data Management Technologies
Now let us deal with the technologies
falling under each of these categories with
their facts and capabilities,along with the
companies which are using them.
105. Data Storage
• Hadoop framework was
designed to store and process
data in a distributed data
processing environment with
commodity hardware with a
simple programming model.
• It can store and analyse the
data present in different
machines with high speeds and
low costs.
106. Data Mining
• Presto is an open source
distributed SQL query engine
for running interactive analytic
queries against data sources
of all sizes ranging from
gigabytes to petabytes.
• Presto allows quering data in
Hive,cassendra, relational
database and proprietary
data stories.
107. Data Analytics
• Apache Kafka is a distributed
streaming paltform. A
streaming platform has three
key capabilities that are as
follows:
. publisher
. Subscriber
. Consumer
108. Data Visualisation
• Tableau is powerful and
fastest growing data
visualisation tool used in the
business intelligence
industry.
• Data analysis is very fast
with tableau and the
visualisation created are in
the form of dashboards and
worksheets.