SlideShare a Scribd company logo
1 of 17
Download to read offline
Challenges of Big Data Research
Manfred M. Fischer
Vienna University of Economics and Business
Regional Science Academy, Special Academic Session, ERSA conference,
Tue 23 August 2016
may be characterized in terms of three dimensions of data management challenges:
Ø Volume refers to the size of the data.
Ø Velocity addresses the speed at which data can be received as well as analyzed.
This also refers to the rate of change of data, which is especially relevant in the
area of stream processing.
Ø Variety refers to the issue of disparate and incompatible data formats. Data can
come in from many different sources and take on many different forms, including
text, audio, video, graph, and more.
1
The Big Data Problem
2
The Big Data Analysis Pipeline
Data Acquisition
and Recording
Information
Extraction and
Cleaning
Data Integration,
Aggregation, and
Representation
Query
Processing, Data
Modelling, and
Analysis
Interpretation
Heterogeneity and
Incompleteness Scale Timeliness Privacy
Human
Collaboration
Major Steps in Analysis of Big Data
Cross-Cutting Challenges
Major steps in analysis of Big Data are shown in the flow at the top. Below it are Big Data needs
that make these tasks challenging.
Overall System
Ø Big Data does not arise out of a vacuum, but is recorded from some data
generating source. Much of this data is of no interest, and can be filtered and
compressed by orders of magnitude.
Ø One challenge is to define these filters in such a way that they do not discard
useful information. We need research in the science of data reduction that can
intelligently process this raw data, and we need on-line techniques that can
process such streaming data on the fly, since we cannot afford to store first and
reduce afterwards.
Ø The second big challenge is to automatically generate the right metadata to
describe what data is recorded and how it is recorded and measured.
3
Data Acquisition and Recording
Ø Frequently the information collected will not be in a format ready for analysis.
Ø We require an information extraction process that pulls out the required
information from the underlying sources and expresses it in a standard form
suitable for analysis.
Ø Doing this correctly and completely is a continuing technical challenge. Note that
this data also includes images and will in the future include video. Such extraction
is often highly application dependent.
Ø Existing work on data cleaning assumes well-recognized constraints on valid data
or well-understood error models. But for many emerging Big Data domains these
do not exist.
4
Information Extraction and Cleaning
Ø Data analysis is considerably more challenging than simply locating, identifying,
understanding, and citing data. For effective large-scale analysis all of this has to
happen in a completely automated manner.
Ø This requires differences in data structuring and semantics to be expressed in
forms that are computer understandable.
Ø There is a strong body of work in data integration that can provide some of the
answers. However, considerable additional work is required to achieve automated
error-free difference resolution.
Ø Even for simpler analyses that depend on only one data set, there remains an
important question of suitable database design.
5
Data Integration, Aggregation, and Representation
Ø Methods for querying and mining Big Data are fundamentally different from
traditional statistical analysis on small samples. Big Data is often noisy, dynamic,
heterogeneous, interrelated and untrustworthy.
Ø Mining requires integrated, cleaned, trustworthy, and efficiently accessible data,
declarative query and mining interfaces, scalable mining algorithms, and Big-Data
computing environments.
Ø In the future, queries towards Big Data will be automatically generated for content
creation on websites, to populate hot-lists or recommendations, and to provide an
ad hoc analysis of the value of a data set to decide whether to store or to discard it.
Ø Scaling complex query processing techniques to terabytes while enabling
interactive response times is a major open research problem today.
Ø Another problem with current Big Data analysis is the lack of coordination
between database systems, which host the data and provide SQL querying, with
analytics packages that perform various forms of non-SQL processing, such as
data mining and statistical analyses.
6
Query Processing, Data Modelling, and Analysis
Ø It is rarely enough to provide just the results. Rather, one must provide
supplementary information [called the provenance of the (result) data] that
explains how each result was derived, and based upon precisely what inputs.
Ø Systems with a rich palette of visualizations become important in conveying to the
users the results of the queries in a way that is best understood in the particular
domain.
Ø Furthermore, with a few clicks the user should be able to drill down into each
piece of data that she sees and understand its provenance, which is a key feature to
understanding the data.
7
Interpretation
Having described the five stages on the Big Data analysis pipeline, we now turn to
some common challenges that underly many, and sometimes all, of these stages.
These are:
Ø Heterogeneity and incompleteness
Ø Scale
Ø Timeliness
Ø Privacy
Ø Human collaboration
8
Common Challenges
Ø When humans consume information, a great deal of heterogeneity is comfortably
tolerated. In fact, the nuance and richness of natural language can provide valuable
depth. However, machine analysis algorithms expect homogeneous data, and
cannot understand nuance. In consequence, data must be carefully structured as a
first step in or prior to data analysis.
Ø Even after data cleaning and error correction, some incompleteness and some
errors in data are likely to remain. This incompleteness and these errors must be
managed during data analysis.
Ø Doing this correctly is a challenge. Recent work on managing probabilistic data
suggests one way to make progress.
9
Heterogeneity and Incompleteness
Ø  Managing large and rapidly increasing volumes of data has been a challenging issue for
many decades. In the past, this challenge was mitigated by processors getting faster.
Ø  Over the last five years the processor technology has made a dramatic shift. Rather than
processors doubling their clock cycle frequency every 18-24 months, now, due to power
constraints, clock speeds have largely stalled and processors are being built with increasing
numbers of cores.
Ø  In the past, large data processing systems had to worry about parallelism across nodes in a
cluster. Now, one has to deal with parallelism within a single node. This requires us to
rethink how we design, build and operate data processing components.
Ø  Another dramatic shift that is underway is the move towards cloud computing, which now
aggregates multiple disparate workloads with varying performance goals into very large
clusters.
Ø  The level of sharing of resources on expensive and large clusters requires new ways of
determining how to run and execute data processing jobs and to deal with system failures.
10
Scale
Ø The design of a system that effectively deals with size is likely also to result in a
system that can process a given size of data set faster. However, it is not just this
speed that is usually meant when one speaks of Velocity in the context of Big
Data. Rather, there is an acquisition rate challenge and a timeliness challenge.
Ø There are many situations in which the result of the analysis is required
immediately. For example, if a fraudulent credit card transaction is suspected, it
should ideally be flagged before the transaction is completed, potentially
preventing the transaction from taking place at all.
Ø Obviously, a full analysis of a user’s purchase history is not likely to be feasible in
real-time. Rather, we need to develop partial results in advance so that a small
amount of incremental computation with new data can be used to arrive at a quick
determination.
11
Timeliness
Ø The privacy of data is another huge concern, and one that increases in the context
of Big Data.
Ø Managing privacy effectively is a technical as well as a sociological problem,
which must be addressed jointly from perspectives to realize the promise of Big
Data.
Ø It is important to rethink security for information sharing in Big Data use cases.
Many online services today require us to share private information, but beyond
record-level access control we do not understand what it means to share data, how
the shared data can be linked, and how to give users fine-grained control over this
sharing.
12
Privacy
Ø  Ideally, analytics for Big Data will be designed to have a human in the loop. The new field
of visual analytics is attempting to do this, at least with respect to the modelling and
analysis phase in the pipeline.
Ø  A popular new method of harnessing human ingenuity to solve problems is through
crowd-sourcing. Wikipedia, the on-line encyclopedia, is perhaps the best known example of
crowd-sourced data.
Ø  We are relying upon information provided by unvetted strangers. While most such errors
will be detected and corrected by others in the crowd, we need technologies to facilitate
this.
Ø  The issues of uncertainty and error become even more pronounced in a specific type of
crowd-sourcing, termed participatory-sensing. In this case, every person with a mobile
phone can act as a multi-modal sensor collecting various types of data instantaneously.
Ø  The extra challenge here is the inherent uncertainty of the data collection devices. The fact
that collected data are probably spatially and temporally correlated can be exploited to
better assess their correctness.
13
Human Collaboration
Ø With Big Data, the use of separate systems becomes prohibitively expensive,
given the large size of the data sets. The expense is due not only to the cost of the
systems themselves, but also the time to load the data into multiple systems.
Ø Big Data has made it necessary to run heterogeneous workloads on a single
infrastructure that is sufficiently flexible to handle all these workloads.
Ø The challenge here is not to build a system that is ideally suited for all processing
tasks. Instead, the need is for the underlying system architecture to be flexible
enough that the components built on top of it, for expressing the various kinds of
processing tasks, can tune it to efficiently run these different workloads.
14
System Architecture
Ø Through better analysis of the large volumes of data, there is the potential for
making faster advances in many scientific disciplines and improving the
profitability and success of many enterprises.
Ø But many technical challenges must be addressed before this potential can be
realized fully.
Ø The challenges include not just the obvious issues of scale, but also heterogeneity,
lack of structure, error-handling, privacy, timeliness, provenance, and
visualization, at all stages of the analysis pipeline from data acquisition to result
interpretation.
Ø These challenges will require transformative solutions, and will not be addressed
naturally by the next generation of industrial products.
Ø Hence, we must support and encourage fundamental research towards addressing
these technical challenges if we are to achieve the promised benefits of Big Data.
15
Closing Remarks
Thank you for your attention
16

More Related Content

What's hot

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsRohithND
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining techniquePawneshwar Datt Rai
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 

What's hot (20)

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data
Big dataBig data
Big data
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 

Similar to Challenges of Big Data Research

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategyHimanshu Bari
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big dataDigimark
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data AnalyticsBHARATH KUMAR
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesDr. Amarjeet Singh
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013WCJones6348
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 

Similar to Challenges of Big Data Research (20)

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
The ABCs of Big Data
The ABCs of Big DataThe ABCs of Big Data
The ABCs of Big Data
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Ab cs of big data
Ab cs of big dataAb cs of big data
Ab cs of big data
 
Big data
Big dataBig data
Big data
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data upload
Big data uploadBig data upload
Big data upload
 
A Survey on Big Data Analytics
A Survey on Big Data AnalyticsA Survey on Big Data Analytics
A Survey on Big Data Analytics
 
Big data
Big dataBig data
Big data
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: Challenges
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 

More from Regional Science Academy

Bots Versus Bohemians: Resiliency of Labor Markets in Automated Cities
Bots Versus Bohemians: Resiliency of Labor Markets in Automated CitiesBots Versus Bohemians: Resiliency of Labor Markets in Automated Cities
Bots Versus Bohemians: Resiliency of Labor Markets in Automated CitiesRegional Science Academy
 
A selection of regional science papers important for my career
A selection of regional science papers important for my careerA selection of regional science papers important for my career
A selection of regional science papers important for my careerRegional Science Academy
 
High-tech services to companies in the city: therise of the modern economy in...
High-tech services to companies in the city: therise of the modern economy in...High-tech services to companies in the city: therise of the modern economy in...
High-tech services to companies in the city: therise of the modern economy in...Regional Science Academy
 
Tourism in the Smart City:a Common place for tourists and residents
Tourism in the Smart City:a Common place for tourists and residentsTourism in the Smart City:a Common place for tourists and residents
Tourism in the Smart City:a Common place for tourists and residentsRegional Science Academy
 
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENT
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENTBANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENT
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENTRegional Science Academy
 
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...Regional Science Academy
 
Regional Brain Drain in the Chilean Economy
Regional Brain Drain in the Chilean EconomyRegional Brain Drain in the Chilean Economy
Regional Brain Drain in the Chilean EconomyRegional Science Academy
 
Creative Capital, Information & Communication Technologies, & Economic Growth...
Creative Capital, Information & Communication Technologies, & Economic Growth...Creative Capital, Information & Communication Technologies, & Economic Growth...
Creative Capital, Information & Communication Technologies, & Economic Growth...Regional Science Academy
 
Data requirements for smart people in smart cities
Data requirements for smart people in smart citiesData requirements for smart people in smart cities
Data requirements for smart people in smart citiesRegional Science Academy
 
Urban Empires – Cities as Global Rulers in the New Urban World
Urban Empires – Cities as Global Rulers in the New Urban WorldUrban Empires – Cities as Global Rulers in the New Urban World
Urban Empires – Cities as Global Rulers in the New Urban WorldRegional Science Academy
 
Geographic Clustering of Craft Breweries in Select American Cities
Geographic Clustering of Craft Breweries in Select American CitiesGeographic Clustering of Craft Breweries in Select American Cities
Geographic Clustering of Craft Breweries in Select American CitiesRegional Science Academy
 

More from Regional Science Academy (20)

Bots Versus Bohemians: Resiliency of Labor Markets in Automated Cities
Bots Versus Bohemians: Resiliency of Labor Markets in Automated CitiesBots Versus Bohemians: Resiliency of Labor Markets in Automated Cities
Bots Versus Bohemians: Resiliency of Labor Markets in Automated Cities
 
A selection of regional science papers important for my career
A selection of regional science papers important for my careerA selection of regional science papers important for my career
A selection of regional science papers important for my career
 
Population and Migration
Population and MigrationPopulation and Migration
Population and Migration
 
Big Data and Big Cities
Big Data and Big CitiesBig Data and Big Cities
Big Data and Big Cities
 
THE CITY IN REGIONAL SCIENCE
THE CITY IN REGIONAL SCIENCETHE CITY IN REGIONAL SCIENCE
THE CITY IN REGIONAL SCIENCE
 
High-tech services to companies in the city: therise of the modern economy in...
High-tech services to companies in the city: therise of the modern economy in...High-tech services to companies in the city: therise of the modern economy in...
High-tech services to companies in the city: therise of the modern economy in...
 
The Geography of Urban Intelligence
The Geography of Urban IntelligenceThe Geography of Urban Intelligence
The Geography of Urban Intelligence
 
Matej Bel - Magnum Decus Hungariae
Matej Bel - Magnum Decus HungariaeMatej Bel - Magnum Decus Hungariae
Matej Bel - Magnum Decus Hungariae
 
Julian Wolpert
Julian Wolpert Julian Wolpert
Julian Wolpert
 
Resilience in Spatial and Urban Systems 2
Resilience in Spatial and Urban Systems 2Resilience in Spatial and Urban Systems 2
Resilience in Spatial and Urban Systems 2
 
Tourism in the Smart City:a Common place for tourists and residents
Tourism in the Smart City:a Common place for tourists and residentsTourism in the Smart City:a Common place for tourists and residents
Tourism in the Smart City:a Common place for tourists and residents
 
Citiesand their (start-up) communities
Citiesand their (start-up) communitiesCitiesand their (start-up) communities
Citiesand their (start-up) communities
 
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENT
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENTBANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENT
BANSKÁ BYSTRICA IN CONTEXT OF SMART URBAN DEVELOPMENT
 
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...
Assessing Metropolitan Transportation Investments: Spatial Econometrics-CGE C...
 
Regional Brain Drain in the Chilean Economy
Regional Brain Drain in the Chilean EconomyRegional Brain Drain in the Chilean Economy
Regional Brain Drain in the Chilean Economy
 
Resilience in Spatial and Urban Systems
Resilience in Spatial and Urban SystemsResilience in Spatial and Urban Systems
Resilience in Spatial and Urban Systems
 
Creative Capital, Information & Communication Technologies, & Economic Growth...
Creative Capital, Information & Communication Technologies, & Economic Growth...Creative Capital, Information & Communication Technologies, & Economic Growth...
Creative Capital, Information & Communication Technologies, & Economic Growth...
 
Data requirements for smart people in smart cities
Data requirements for smart people in smart citiesData requirements for smart people in smart cities
Data requirements for smart people in smart cities
 
Urban Empires – Cities as Global Rulers in the New Urban World
Urban Empires – Cities as Global Rulers in the New Urban WorldUrban Empires – Cities as Global Rulers in the New Urban World
Urban Empires – Cities as Global Rulers in the New Urban World
 
Geographic Clustering of Craft Breweries in Select American Cities
Geographic Clustering of Craft Breweries in Select American CitiesGeographic Clustering of Craft Breweries in Select American Cities
Geographic Clustering of Craft Breweries in Select American Cities
 

Recently uploaded

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 

Challenges of Big Data Research

  • 1. Challenges of Big Data Research Manfred M. Fischer Vienna University of Economics and Business Regional Science Academy, Special Academic Session, ERSA conference, Tue 23 August 2016
  • 2. may be characterized in terms of three dimensions of data management challenges: Ø Volume refers to the size of the data. Ø Velocity addresses the speed at which data can be received as well as analyzed. This also refers to the rate of change of data, which is especially relevant in the area of stream processing. Ø Variety refers to the issue of disparate and incompatible data formats. Data can come in from many different sources and take on many different forms, including text, audio, video, graph, and more. 1 The Big Data Problem
  • 3. 2 The Big Data Analysis Pipeline Data Acquisition and Recording Information Extraction and Cleaning Data Integration, Aggregation, and Representation Query Processing, Data Modelling, and Analysis Interpretation Heterogeneity and Incompleteness Scale Timeliness Privacy Human Collaboration Major Steps in Analysis of Big Data Cross-Cutting Challenges Major steps in analysis of Big Data are shown in the flow at the top. Below it are Big Data needs that make these tasks challenging. Overall System
  • 4. Ø Big Data does not arise out of a vacuum, but is recorded from some data generating source. Much of this data is of no interest, and can be filtered and compressed by orders of magnitude. Ø One challenge is to define these filters in such a way that they do not discard useful information. We need research in the science of data reduction that can intelligently process this raw data, and we need on-line techniques that can process such streaming data on the fly, since we cannot afford to store first and reduce afterwards. Ø The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. 3 Data Acquisition and Recording
  • 5. Ø Frequently the information collected will not be in a format ready for analysis. Ø We require an information extraction process that pulls out the required information from the underlying sources and expresses it in a standard form suitable for analysis. Ø Doing this correctly and completely is a continuing technical challenge. Note that this data also includes images and will in the future include video. Such extraction is often highly application dependent. Ø Existing work on data cleaning assumes well-recognized constraints on valid data or well-understood error models. But for many emerging Big Data domains these do not exist. 4 Information Extraction and Cleaning
  • 6. Ø Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. Ø This requires differences in data structuring and semantics to be expressed in forms that are computer understandable. Ø There is a strong body of work in data integration that can provide some of the answers. However, considerable additional work is required to achieve automated error-free difference resolution. Ø Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. 5 Data Integration, Aggregation, and Representation
  • 7. Ø Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, interrelated and untrustworthy. Ø Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and Big-Data computing environments. Ø In the future, queries towards Big Data will be automatically generated for content creation on websites, to populate hot-lists or recommendations, and to provide an ad hoc analysis of the value of a data set to decide whether to store or to discard it. Ø Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today. Ø Another problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-SQL processing, such as data mining and statistical analyses. 6 Query Processing, Data Modelling, and Analysis
  • 8. Ø It is rarely enough to provide just the results. Rather, one must provide supplementary information [called the provenance of the (result) data] that explains how each result was derived, and based upon precisely what inputs. Ø Systems with a rich palette of visualizations become important in conveying to the users the results of the queries in a way that is best understood in the particular domain. Ø Furthermore, with a few clicks the user should be able to drill down into each piece of data that she sees and understand its provenance, which is a key feature to understanding the data. 7 Interpretation
  • 9. Having described the five stages on the Big Data analysis pipeline, we now turn to some common challenges that underly many, and sometimes all, of these stages. These are: Ø Heterogeneity and incompleteness Ø Scale Ø Timeliness Ø Privacy Ø Human collaboration 8 Common Challenges
  • 10. Ø When humans consume information, a great deal of heterogeneity is comfortably tolerated. In fact, the nuance and richness of natural language can provide valuable depth. However, machine analysis algorithms expect homogeneous data, and cannot understand nuance. In consequence, data must be carefully structured as a first step in or prior to data analysis. Ø Even after data cleaning and error correction, some incompleteness and some errors in data are likely to remain. This incompleteness and these errors must be managed during data analysis. Ø Doing this correctly is a challenge. Recent work on managing probabilistic data suggests one way to make progress. 9 Heterogeneity and Incompleteness
  • 11. Ø  Managing large and rapidly increasing volumes of data has been a challenging issue for many decades. In the past, this challenge was mitigated by processors getting faster. Ø  Over the last five years the processor technology has made a dramatic shift. Rather than processors doubling their clock cycle frequency every 18-24 months, now, due to power constraints, clock speeds have largely stalled and processors are being built with increasing numbers of cores. Ø  In the past, large data processing systems had to worry about parallelism across nodes in a cluster. Now, one has to deal with parallelism within a single node. This requires us to rethink how we design, build and operate data processing components. Ø  Another dramatic shift that is underway is the move towards cloud computing, which now aggregates multiple disparate workloads with varying performance goals into very large clusters. Ø  The level of sharing of resources on expensive and large clusters requires new ways of determining how to run and execute data processing jobs and to deal with system failures. 10 Scale
  • 12. Ø The design of a system that effectively deals with size is likely also to result in a system that can process a given size of data set faster. However, it is not just this speed that is usually meant when one speaks of Velocity in the context of Big Data. Rather, there is an acquisition rate challenge and a timeliness challenge. Ø There are many situations in which the result of the analysis is required immediately. For example, if a fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed, potentially preventing the transaction from taking place at all. Ø Obviously, a full analysis of a user’s purchase history is not likely to be feasible in real-time. Rather, we need to develop partial results in advance so that a small amount of incremental computation with new data can be used to arrive at a quick determination. 11 Timeliness
  • 13. Ø The privacy of data is another huge concern, and one that increases in the context of Big Data. Ø Managing privacy effectively is a technical as well as a sociological problem, which must be addressed jointly from perspectives to realize the promise of Big Data. Ø It is important to rethink security for information sharing in Big Data use cases. Many online services today require us to share private information, but beyond record-level access control we do not understand what it means to share data, how the shared data can be linked, and how to give users fine-grained control over this sharing. 12 Privacy
  • 14. Ø  Ideally, analytics for Big Data will be designed to have a human in the loop. The new field of visual analytics is attempting to do this, at least with respect to the modelling and analysis phase in the pipeline. Ø  A popular new method of harnessing human ingenuity to solve problems is through crowd-sourcing. Wikipedia, the on-line encyclopedia, is perhaps the best known example of crowd-sourced data. Ø  We are relying upon information provided by unvetted strangers. While most such errors will be detected and corrected by others in the crowd, we need technologies to facilitate this. Ø  The issues of uncertainty and error become even more pronounced in a specific type of crowd-sourcing, termed participatory-sensing. In this case, every person with a mobile phone can act as a multi-modal sensor collecting various types of data instantaneously. Ø  The extra challenge here is the inherent uncertainty of the data collection devices. The fact that collected data are probably spatially and temporally correlated can be exploited to better assess their correctness. 13 Human Collaboration
  • 15. Ø With Big Data, the use of separate systems becomes prohibitively expensive, given the large size of the data sets. The expense is due not only to the cost of the systems themselves, but also the time to load the data into multiple systems. Ø Big Data has made it necessary to run heterogeneous workloads on a single infrastructure that is sufficiently flexible to handle all these workloads. Ø The challenge here is not to build a system that is ideally suited for all processing tasks. Instead, the need is for the underlying system architecture to be flexible enough that the components built on top of it, for expressing the various kinds of processing tasks, can tune it to efficiently run these different workloads. 14 System Architecture
  • 16. Ø Through better analysis of the large volumes of data, there is the potential for making faster advances in many scientific disciplines and improving the profitability and success of many enterprises. Ø But many technical challenges must be addressed before this potential can be realized fully. Ø The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation. Ø These challenges will require transformative solutions, and will not be addressed naturally by the next generation of industrial products. Ø Hence, we must support and encourage fundamental research towards addressing these technical challenges if we are to achieve the promised benefits of Big Data. 15 Closing Remarks
  • 17. Thank you for your attention 16