The document discusses big data analytics and related topics. It provides definitions of big data, describes the increasing volume, velocity and variety of data. It also discusses challenges in data representation, storage, analytical mechanisms and other aspects of working with large datasets. Approaches for extracting value from big data are examined, along with applications in various domains.
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.
Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Big data nowadays is a new challenge to be managed, not as a barrier to grow up business. Data storages costs relatively is inexpensive, with more transactions generated from social media, machine, and sensors, data increased from pieces by pieces into pentabytes.
This slide explained what the challenges of Big Data (Volume, Velocity, and Variety) and give a solution how to managed them.
There are many tools that could help to solve the problems, but the main focus tools in this slide is Apache Hadoop.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
Big data nowadays is a new challenge to be managed, not as a barrier to grow up business. Data storages costs relatively is inexpensive, with more transactions generated from social media, machine, and sensors, data increased from pieces by pieces into pentabytes.
This slide explained what the challenges of Big Data (Volume, Velocity, and Variety) and give a solution how to managed them.
There are many tools that could help to solve the problems, but the main focus tools in this slide is Apache Hadoop.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
A high level overview of common Cassandra use cases, adoption reasons, BigData trends, DataStax Enterprise and the future of BigData given at the 7th Advanced Computing Conference in Seoul, South Korea
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks. Hadoop is an open source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Mr. Ketan Bagade | Mrs. Anjali Gharat | Mrs. Helina Tandel "A Review Paper on Big Data and Hadoop for Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29816.pdf Paper URL: https://www.ijtsrd.com/computer-science/data-miining/29816/a-review-paper-on-big-data-and-hadoop-for-data-science/mr-ketan-bagade
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
Moving Toward Big Data: Challenges, Trends and PerspectivesIJRESJOURNAL
Abstract: Big data refers to the organizational data asset that exceeds the volume, velocity, and variety of data typically stored using traditional structured database technologies. This type of data has become the important resource from which organizations can get valuable insightand make business decision by applying predictive analysis. This paper provides a comprehensive view of current status of big data development,starting from the definition and the description of Hadoop and MapReduce – the framework that standardizes the use of cluster of commodity machines to analyze big data. For the organizations that are ready to embrace big data technology, significant adjustments on infrastructure andthe roles played byIT professionals and BI practitioners must be anticipated which is discussed in the challenges of big data section. The landscape of big data development change rapidly which is directly related to the trend of big data. Clearly, a major part of the trend is the result ofthe attempt to deal with the challenges discussed earlier. Lastly the paper includes the most recent job prospective related to big data. The description of several job titles that comprise the workforce in the area of big data are also included.
Most common technology which is used to store meta data and large databases.we can find numerous applications in the real world.It is the very useful for creating new database oriented apps
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value. We then present background technology for big data summarization brings to us. The objective of this paper is to discuss the big data summarization framework, challenges and possible solutions as well as methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of open problems and future directions..
Big Data Summarization : Framework, Challenges and Possible Solutionsaciijournal
In this paper, we first briefly review the concept of big data, including its definition, features, and value.
We then present background technology for big data summarization brings to us. The objective of this
paper is to discuss the big data summarization framework, challenges and possible solutions as well as
methods of evaluation for big data summarization. Finally, we conclude the paper with a discussion of open problems and future directions..
Similar to Big data analytics, survey r.nabati (20)
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
3. explosive increase of global data
In 2011 overall created and copied data
volume in the world was 1.8ZB
the term of big data is mainly used to
describe enormous datasets
many government agencies announced
major plans to accelerate big data research
and applications
4.
5. Big data is an abstract concept
Different with “massive data” or “very big
data.”
In 2010, Apache Hadoop defined big data as
“datasets which could not be captured,
managed, and processed by general
computers within an acceptable scope.”
6. in May 2011, McKinsey & Company. Big data
shall mean such datasets which could not be
acquired, stored, and managed by classic
database software
7. Doug Laney, an analyst of META (presently
Gartner)defined challenges and opportunities
brought about by increased data with a 3Vs
model
The increase ofVolume,Velocity, andVariety
8. The increase ofVolume,Velocity,Variety and
Big DataValue
9. retailers that fully utilize big data may
improve their profit by more than 60 %
developed economies in Europe could save
over EUR 100 billion (which excludes the
effect of reduced frauds, errors, and tax
difference)
Extract wisdom from data
10. Data representation
Data representation aims to make data more
meaningful for computer analysis and user
interpretation
Levels of heterogeneity in type, structure,
semantics, organization, granularity, and
accessibility.
11. Redundancy reduction and data
compression
There is high level of redundancy in datasets
Is effective to reduce the indirect cost of the
entire system
12. Data life cycle management
Values hidden in big data depend on data
freshness
Which data shall be stored and which data
shall be discarded.
13. Analytical mechanism
Traditional RDBMSs are strictly designed
with a lack of scalability and expandability
We shall find a compromising solution
between RDBMSs and non-relational
databases.
15. Energy management
From both economy and environment
perspectives
With increase of data volume and analytical
demands, processing, storage, and
transmission of big data consume more
electric energy so:
System-level power consumption control
Management mechanism.
16. Expendability and scalability
The analytical system of big data must
support present and future datasets.
The analytical algorithm must be able to
process increasingly expanding and more
complex datasets.
17. Cooperation
Analysis of big data is an interdisciplinary
research, which requires experts in different
fields cooperate.
18.
19. The main objective of cloud computing is to
use huge computing and storage resources
under concentrated management.
The development of cloud computing
provides solutions for the storage and
processing of big data.
The distributed storage technology based on cloud computing can
effectively manage big data
The parallel computing capacity by virtue of cloud computing can
improve the efficiency of acquisition and analyzing big data.
20. Cloud computing transforms the IT
architecture
While big data influences business decision
making.
Big data and cloud computing have different
target customers
Cloud computing is an advanced IT solution
Big data focusing on business operations.
21.
22. Enormous amount of networking sensors are
embedded into various devices and machines
in the real world.
collect various kinds of data, such as
environmental data, geographical data,
astronomical data, and logistic data.
Mobile equipment's, transportation facilities,
public facilities, and home appliances could
all be data acquisition equipments in IoT
23.
24. Not only is a platform for concentrated
storage of data.
acquiring data
managing data
organizing data
leveraging the data values and functions.
25. Hadoop is widely used in big data
applications in the industry, e.g., spam
filtering, network searching, clickstream
analysis, and social recommendation.
Academic research is now based on Hadoop.
In June 2012,Yahoo runs Hadoop in 42,000
servers at four data centers to support its
products and services, e.g., searching and
spam filtering
26. Facebook announced that their Hadoop
cluster can process 100 PB data, which grew
by 0.5 PB per day as in November 2012.
many companies provide Hadoop ommercial
execution and/or support, including Cloudera,
IBM, MapR, EMC, and Oracle.
27.
28. Enterprise data
Amazon processes millions of terminal
operations and more than 500,000 queries
from third-party sellers per day.
Walmart processes one million customer
trades per hour, over 2.5PB.
Akamai analyzes 75 million events per day
29. Come from industry, agriculture, traffic,
transportation, medical care, public
departments, and families, etc.
The data generated from IoT has the following
features:
Large-scale data
Heterogeneity
Strong time and space correlation
Effective data accounts for only a small portion of
the big data
30. High-throughput bio-measurement
technologies are developed
The completion of HGP (Human Genome
Project)
One sequencing of human gene may
generate 100-600GB raw data.
1.15 million human samples and 150,000
animal, plant, and microorganism samples in
China National Genebank, 2015, 30 million.
31. Clinical medical care and medical R & D Data
Pittsburgh Medical Center has stored 2TB such
data.
By 2015, the average data volume of every
hospital will increase from 167TB to 665TB.
32. Astronomy data, recorded 25TB data from
1998 to 2008.
In the beginning of 2008, the Atlas
experiment of Large Hadron Collider (LHC) of
European Organization for Nuclear Research
generates raw data at 2PB/s and stores about
10TB processed data per year.
Mobile data.
Etc.
34. Log files
Sensing
Methods for acquiring network data(crowlers
etc.)
Lipcap-based pocket capture-easy to use library
Zero copy packet capture
Monitoring software’s:Wireshark, Smartsniff,
WinNetCap
Mobile equipment’s
35. Inter-DCN transmissions
from data source to data center
Intra-DCNTransmissions
data communication flows within data centers
High speed infrastructure
36. The collected datasets vary with respect to noise,
redundancy, and consistency, and it is undoubtedly a
waste to store meaningless data.
Integration
Combination of data from different sources and
provides users with a uniform view of data
Cleaning
Cleaning: data cleaning is a process to identify
inaccurate, incomplete, or unreasonable data,
Redundancy elimination
37.
38. Direct Attached Storage (DAS)
Various hard disks are directly connected with
servers
Network Storage
Network Attached Storage (NAS)
Storage Area Network (SAN).
40. Consistency
A distributed storage system requires multiple
servers to cooperatively store data
Availability
It would be desirable if the entire system is not
seriously affected to satisfy customer’s requests in
terms of reading and writing.
PartitionTolerance
The distributed storage still works well when the
network is partitioned
41. Eric Brewer proposed a CAP theory in 2000,
which indicated that a distributed system
could not simultaneously meet the
requirements on consistency, availability, and
partition tolerance; at most two of the three
requirements can be satisfied simultaneously.
Seth Gilbert and Nancy Lynch from MIT
proved the correctness of CAP theory in 2002.
42. Eric Brewer proposed a CAP theory in 2000,
which indicated that a distributed system
could not simultaneously meet the
requirements on consistency, availability, and
partition tolerance; at most two of the three
requirements can be satisfied simultaneously.
Seth Gilbert and Nancy Lynch from MIT
proved the correctness of CAP theory in 2002.
43. classified into three bottom-up levels:
1. File systems
2. Databases
3. Programming models.
44. classified into three bottom-up levels:
1. File systems
Google’s GFS, HDFS, Microsoft Cosmos,
Facebook Haystack,TaobaoTFS and FastDFS
45. classified into three bottom-up levels:
1. File systems
2. Databases, NoSQL
1. Key-value Databases: Dynamo, Voldemort
2. Column-oriented Database: BigTable, Cassandra
3. Document Database: More complex with KV
MongoDB, SimpleDB, and CouchDB
46. classified into three bottom-up levels:
1. File systems
2. Databases, NoSQL
3. Programming models
1.MapReduce: Automatic parallel processing. Sawzall
of Google, Pig Latin ofYahoo, Hive of Facebook,
and Scope of Microsoft.
2.Dryad: Dryad is a general-purpose distributed
execution engine for processing parallel
applications. And All-Pairs.
47.
48.
49. Use proper statistical methods to analyze
massive data, to concentrate, extract, and
refine useful data hidden in a batch of chaotic
datasets, and to identify the inherent law of
the subject matter.
50. Cluster Analysis: is a statistical method for
grouping objects, and specifically, classifying
objects according to some features.
51. Factor Analysis: is basically targeted at
describing the relation among many
elements with only a few factors, i.e.,
grouping several closely related variables into
a factor, and the few factors are then used to
reveal the most information of the original
data.
52. Correlation Analysis: is an analytical method
for determining the law of relations, such as
correlation, correlative dependence, and
mutual restriction, among observed
phenomena and accordingly conducting
forecast and control.
53. Regression Analysis: is a mathematical tool
for revealing correlations between one
variable and several other variables.
54. A/BTesting: also called bucket testing. It is a
technology for determining how to improve
target variables by comparing the tested
group. Big data will require a large number of
tests to be executed and analyzed.
55. Statistical Analysis: Statistical analysis is
based on the statistical theory, a branch of
applied mathematics.
56. Data Mining Algorithms: Data mining is a
process for extracting hidden, unknown, but
potentially useful information and knowledge
from massive, incomplete, noisy, fuzzy, and
random data.
57. Bloom Filter: Bloom Filter consists of a series
of Hash functions.
58. Hashing: it is a method that essentially
transforms data into shorter fixed-length
numerical values or index values.
59. Index: index is always an effective method to
reduce the expense of disk reading and
writing, and improve insertion, deletion,
modification, and query speeds in both
traditional relational databases that manage
structured data, and other technologies that
manage semi structured and unstructured
data
60. Triel: also called trie tree, a variant of Hash
Tree. It is mainly applied to rapid retrieval and
word frequency statistics.The main idea of
Triel is to utilize common prefixes of
character strings to reduce comparison on
character strings to the greatest extent, so as
to improve query efficiency.
61. Parallel Computing: compared to traditional
serial computing, parallel computing refers to
simultaneously utilizing several computing
resources to complete a computation task.
Presently, some classic parallel computing
models include MPI (Message Passing
Interface), MapReduce, and Dryad
62. 1.Real-time vs. offline analysis:
Real-time is mainly used in E-commerce and
finance.
Offline analysis: is usually used for
applications without high requirements on
response time, e.g., machine learning,
statistical analysis, and recommendation
algorithms.
63. 2.Analysis at different levels
Memory-level analysis: is for the case where the
total data volume is smaller than the maximum
memory of a cluster.
BI analysis: is for the case when the data scale
surpasses the memory level but may be imported
into the BI analysis environment.
Massive analysis: is for the case when the data
scale has completely surpassed the capacities of BI
products and traditional relational databases.
65. R (30.7 %): R, an open source programming
language and software environment, is
designed for data mining/analysis and
visualization.
Excel (29.8 %): Excel, a core component of
Microsoft Office, provides powerful data
processing and statistical analysis
capabilities.
66. Rapid-I Rapidminer (26.7 %): Rapidminer is an
open source software used for data mining,
machine learning, and predictive analysis.
KNMINE (21.8%): KNIME (Konstanz
Information Miner) is a user-friendly,
intelligent, and open-source rich data
integration, data processing, data analysis,
and data mining platform
67. Weka/Pentaho (14.8 %):Weka, abbreviated
fromWaikato Environment for Knowledge
Analysis, is a free and open-source machine
learning and data mining software written in
Java.
68.
69. Application of big data in enterprises
Application of IoT based big data
Application of online social network-oriented
big data
Application of online social network-oriented big
data
Structure-based Applications
EarlyWarning
Real-time Monitoring
Real-time Feedback
73. Fundamental problems of big data:There is a
compelling need for a rigorous and holistic
definition of big data, a structural model of
big data, a formal description of big data.
Standardization of big data: An evaluation
system of data quality and an evaluation
standard/benchmark of data computing
efficiency should be developed.
74. Evolution of big data computing modes:This
includes memory mode, data flow mode,
PRAM mode, and MR mode, etc.
75. Evolution of big data computing modes:This
includes memory mode, data flow mode,
PRAM mode, and MR mode, etc.
76. Format conversion of big data
Big data transfer
Real-time performance of big data
Processing of big data
77. Big data management
Searching, mining, and analysis of big data
Integration and provenance of big data
Big data application
78. Big data privacy
Protection of personal privacy during data acquisition
Personal privacy data may also be leaked during storage,
transmission, and usage, even if acquired with the
permission of users
Data quality: Data quality influences big data utilization.
Low quality data wastes transmission and storage
resources with poor usability
Big data safety mechanism: Big data brings about
challenges to data encryption due to its large scale and
high diversity.
Dryad is a directed acyclic graph (DAG)
Dryad executes operations on the vertexes in clusters
and transmits data via data channels, including documents,
TCP connections, and shared-memory FIFO.
During operation, resources in a logic operation graph
are automatically map to physical resources.