Big data: Challenges, Practices and Technologies

Big Data: Challenges, Practices and Technologies
NIST Big Data Public Working Group Workshop at IEEE Big Data 2014
Nancy W. Grady1
, Mark Underwood2
, Arnab Roy3
, Wo L. Chang4
1
Science Applications International Corporation, nancy.w.grady@saic.com
2
Krypton Brothers, LLC, mark.underwood@kryptonbrothers.com
3
Fujitsu Laboratories of America, aroy@us.fujitsu.com
4
National Institute of Standards and Technology, wchang@nist.gov
Abstract—Big Data has changed both technologies and
practices for building data analytics systems. A number of
working groups have been discussing the recent changes along a
number of dimensions. The NIST Big Data Public Working
Group organized a workshop to promote communication among
working groups, technologists, and practitioners to come to an
understanding of the current state of the Big Data discipline,
collaboration best practices, future directions for this emerging
specialization, and to identify security and privacy concerns.
Keywords—Big Data; reference architecture; collaboration;
security; privacy; metadata; standards
I. INTRODUCTION
NIST has been facilitating a Big Data Public Working
Group (NBD-PWG) to form a community of interest from
industry, academia, and government to promote better
understanding of this new discipline. The aim has been the
development of consensus definitions, taxonomies, reference
architectures, and technology roadmaps based on an
understanding of use cases and requirements. The goal is to
create vendor-neutral, technology and infrastructure agnostic
vocabulary and descriptions. This will enable Big Data
stakeholders to better understand the emerging discipline, and
to choose the best-suited analytics tools for their processing
and visualization requirements on the most appropriate
computing platforms and clusters. By providing a framework
for communication, needs and capabilities can be better
matched between technologists and practitioners.
To further the NBD-PWG goals, a workshop was staged at
the IEEE Big Data 2014 conference to bring together
technologists and practitioners, to understand what has
changed with Big Data, assess the current state of the art,
identify lessons learned, and surface known challenges. To
span this new discipline, four panels were organized: The State
of Big Data Technology, Big Data Future Trends, Big Data
Sharing and Collaboration, and Big Data Security and Privacy.
II. THE STATE OF BIG DATA TECHNOLOGY
The term Big Data has come to mean many things, and
communication about new approaches has been hindered by
the conflicting vocabulary and definitions. To better
understand this emerging discipline, this panel discussed
frameworks for understanding the new architectures, use cases
and requirements, and benchmarks.
A. Data Consistency Issues in Big Data Systems – Jianmin
Wang, Tsinghua Big Data Research Center
Distributed storage systems are required to guarantee data
reliability, fault-tolerance and accessibility for users. Besides
hardware configuration, the design and implementation of
distributed systems is very important to reach these goals. The
most common solution is that we store multiple copies of the
same data in different storage devices. The multiple copies are
called data replica.
We take two popular distributed storage systems as
examples to analyze the working mechanism and replica
consistency. The first one is Cassandra, propagating data in a
star model and the second one is HDFS, propagating data in a
chain model.
B. NIST Big Data Interoperability Framework – Orit Levin,
Microsoft
The National Institute of Standards and Technology (NIST)
NIST Big Data Interoperability Framework, Volume 6:
Reference Architecture is one of seven volumes in the
roadmap, whose overall aims are to define and prioritize Big
Data requirements, including interoperability, portability,
reusability, extensibility, data usage, analytic techniques, and
technology infrastructure in order to support secure and
effective adoption of Big Data. The Reference Architecture is
dedicated to developing a vendor-neutral, technology- and
infrastructure-agnostic conceptual model and examining
related issues. Created by the NIST Big Data Public Working
Group (NBD-PWG) Reference Architecture Subgroup, the
conceptual model is based on the analysis of public Big Data
material and inputs from the other NBD-PWG subgroups. The
NIST Big Data Reference Architecture (NBD-RA) is
applicable to a variety of business environments including
tightly-integrated enterprise systems, as well as loosely-
coupled vertical industries that rely on the cooperation of
independent stakeholders.
C. NIST - Use Cases and Requirements- Geoffrey Fox,
Indiana University
The NIST Big Data Public Working Group collected an
extensive catalog of use cases, reflecting Big Data applications
in public health, epidemiology, U.S. census, cargo shipping,
geointelligence, defense, genomics, recommendation engines
and media. These applications were then examined in the
2014 IEEE International Conference on Big Data
978-1-4799-5666-1/14/$31.00 ©2014 IEEE 11

context of the NIST group’s reference architecture to identify
recurring patterns thought to be specific to Big Data
applications. These patterns were further explored in light of
current Apache stack offerings. These insights will likely be
useful to prospective system designers.
D. Introducing TPCx-HS – first Industry Standard for
Benchmarking Big Data Systems – Raghunath Nambiar,
Cisco
Over the past quarter century, industry standard
benchmarks have had a significant impact on the computing
industry. Vendors use benchmark standards to illustrate
performance competitiveness for their existing products, and to
improve and monitor the performance of their products under
development. Many buyers use the results as points of
comparison when purchasing new computing systems.
Continuing on the Transaction Processing Performance
Council’s commitment to bring relevant benchmarks to
industry, the TPC announced TPCx-HS – the first standard that
provides verifiable performance, price/performance and energy
consumption metrics for Big Data systems. TPCx-HS can be
used to assess a broad range of system topologies and
implementation methodologies for Hadoop, in a technically
rigorous and directly comparable, vendor-neutral manner. And
while modeling is based on a simple application, the results are
highly relevant to Big Data hardware and software systems.
III. BIG DATA FUTURE DIRECTIONS
Is volume, velocity, variety, veracity or some other facet of
Big Data most critical for planning a particular Big Data
project? Will a given deployment, even if well considered,
find itself overtaken by a superseding technology? What are
the emerging trends and technologies to be aware of? These are
questions practitioners must entertain now as new commercial
releases are transforming the capabilities of widely used Big
Data software. The Future Directions panel considers likely
Big Data trends in hardware, computing models, analytics and
measurement.
A. InfoSymbiotics/DDDAS and the Nest Generation of Big
Data and Big Computing – Frederica Darema, Air Force
Office of Scientific Research
We describe the DDDAS (Dynamic Data Driven
Applications Systems), a new paradigm unifying systems
modeling and systems instrumentation. DDDAS can facilitate
new capabilities for advanced modeling/simulation and
intelligent exploitation of data of engineered, natural, and
societal multi-entity systems. Results may include improved
understanding, analysis, and optimized, autonomic
management and decision support of operational conditions of
these systems.
The key underlying concept in DDDAS is the dynamic
integration between data and computation, whereby
instrumentation data and executing models of systems become
a feedback control loop. On-line data are dynamically
incorporated into executing models of the system to improve
the accuracy or speedup the simulation, and in reverse the
executing model controls the instrumentation to selectively
target the data collection process to improve accuracy and
measurability.
This paradigm, unifying modeling and instrumentation, is
timely with the advent of large-scale dynamic data and large-
scale big computing. Large-scale dynamic data is the next
wave of Big Data, namely dynamic data arising from
ubiquitous sensing and control in engineered, natural, and
societal systems. Numerous heterogeneous sensors and
controllers will instrument these systems. The opportunities
and challenges at these “large-scales” relate not only to the size
of the data but the heterogeneity in data, data collection
modalities, data fidelities, and timescale -- ranging from real-
time data moving in microseconds to data at rest (archive). In
tandem with this important dimension of dynamic data is an
extended view of Big Computing, which includes a new
dimension of distributed computing; that is, the range of
computing from the high-end to computing at the sensor and
controller levels, and in particular the collections of networked
assemblies of sensors and controllers.
The DDDAS paradigm, driving and exploiting notions of
large-scale dynamic data and large-scale Big Computing, is
shaping research directions and transforming a range of
application areas. Examples of advances and new capabilities
are presented. These include analysis and decision support for
structural systems, manufacturing, environmental and critical
infrastructure (such as urban and air transportation), and power
grids.
B. NIST Roadmap and Standards – David Boyd, L-3 Data
Tactics
The NIST Big Data Interoperability Framework: Volume 7,
Technology Roadmap was prepared by the NBD-PWG’s
Technology Roadmap Subgroup. It addresses the overarching
information and context about key questions such as:
• When is data considered “Big”?
• How did Big Data evolve?
• What will it evolve to?
• How is technology developing to deal with Big Data in
terms of storage, organization, processing, and resource
management?
• What standards are needed and evolving to deal with Big
Data? and,
• How might organizations address their Big Data
challenges?
This presentation will discuss the issues of Organizational
readiness, technology readiness, technology features, standards
initiatives and strategies.
C. Big Data Analytics Interest Group (BDA IG) of Research
Data Alliance (RDA) – Kwo-Sen Kuo, Bayesics
The Big Data Analytics (BDA) Interest Group was formed
to develop community based recommendations for viable data
analytics approaches to address scientific community needs of
12

efficiently utilizing large quantities of data. It supports
formation of working groups to tackle specific problems.
• BDA aims to clarify some foundational terminologies in
the context of data analytics understanding
differences/overlaps with terms like data science, data
analysis, data mining, etc.
• BDA will develop a recommendation document with a
systematic classification of feasible combinations of
analysis algorithms, analytical tools, data and resource
characteristics and scientific queries. These
recommendation documents can serve as a best practice
guide for scientific groups/communities interested in
investing in Big Data technologies.
• BDA works to develop a consensus amongst its members
to achieve this desired goal.
• BDA collaborates with external bodies and initiatives -
such as NIST, OGC, ISO, EarthServer and others.
D. Next-Generation Computing Systems for Big Data
Machine Learning and Graph Analytics – H. Howie
Huang, George Washington University
Big data machine learning and graph analytics have been
widely used in industry, academia and government.
Continuous advance in this area is critical to business success,
scientific discovery, as well as cybersecurity. In this position
paper, we present the current state of the art, and propose that
next-generation computing systems for Big Data machine
learning and graph analytics need innovative designs in both
hardware and software that provide a good match between Big
Data algorithms and the underlying computing and storage
resources.
IV. BIG DATA SHARING AND COLLABORATION
Critical to moving Big Data forward as a discipline are the
methods needed for improving both collaboration and data
sharing. We are familiar with the cooperation for open source
technology development and in online courses, but how do we
cooperatively move forward and put these technologies into
practice? How do we better provision data frameworks to
promote technology adoption, data sharing and data reuse?
A. Public Private Collaboration – Johan Bos-Beier, ACT/IAC
ACT-IAC Big Data Committee seeks to enable government
agencies to make better data-driven decisions through the
analysis, management, integration, and representation of large
and complex data stores. The BDC seeks to:
• Provide a forum for information sharing and collaboration
between federal, state, and local government agencies
seeking to leverage their data for better informed decision-
making.
• Advise or recommend approaches to developing Big Data
technical frameworks and capability maturity model
assessments.
• Promote Big Data best practices through increasing
awareness of Big Data research, technologies, use cases,
and high performance computing within the Federal
Government.
B. Implementation of Big Data Applications in Government
and Science Communities – Joan L. Aron, Federal Big
Data Working Group
A conceptual overview sets the context for the uses of Big
Data for knowledge discovery and decision support and the
challenges in developing applications. The federation of use
cases, data publications, solutions & technologies provides
examples. Semantic analysis is the basis of solutions for many
applications for government and science communities. The
federal government has greater needs for aggregating data
while maintaining compliance with privacy and security
requirements. Cognitive metadata, which is the metadata
coming from enhancing machine learning with our human
perception, reasoning or intuition, can be used for
personalization purposes and conversely for protecting
personally identifiable information (PII). A new technology
for natural language understanding can be used to find high-
value information in a large body of texts, such as a collection
of agency reports, with little specialized training. Advances in
high-performance computational hardware are also important.
A semantic MEDLINE for searching biomedical research
literature uses hardware built for Resource Description
Framework (RDF) triples in a graph database and semantic
processing developed at the National Library of Medicine. A
high-performance computing cluster environment is in use for
searching public records, patent data, case law and news
articles. Use cases with a focus on environment and Earth
system science illustrate achievements and challenges for the
use of Big Data in data publishing and data access, data
discovery and decision support, and workforce development
for the scientific community and decision-makers to work with
data science.
C. Data-Intensive Science Challenges – Thomas Huang,
NASA Earth Science Data Systems Data-Intensive
Architecture Working Group
Data-Intensive Science defines three high-level activities:
capture, curation, and analysis of data. Tackling Big Science
Data requires more than just infusing Cloud Computing,
Hadoop, and NoSQL. Science data system architecture is an
orchestration of people, process, policies, and technologies. It
requires thorough understanding of the problem space,
assessment of technologies available, process that is repeatable
and traceable, and an adaptable architecture. This session
focuses on architectural discussion and enabling technologies
for tackling data-intensive science. The discussion should be
supported by use cases as the instrument to facilitate review of
current science data systems and assessment of some of the
enabling technologies.
D. Big Data Provenance and Metadata – Rajeev Agrawal,
North Carolina A&T State University
With the progress of new technology, the volume and
complexity of data produced and processed in scientific
research is increasing remarkably. This data is growing so fast
that existing resources are facing difficulty to analyze data
13

properly. It is important to properly track scientific workflows
to provide context and reproducibility. Provenance deals with
this need and assists scientists by delivering the lineage or
history of the way of generating, using and modifying data. We
discuss a complete workflow of tracking provenance
information of Big Data.
V. BIG DATA SECURITY AND PRIVACY
The distribution of data across resources, and the
involvement of a number of organizations in one system open
up new concerns for security and privacy. This panel will focus
on the areas that are new and different because of the Big Data
architectures. The panel will discuss the state of the art in
security and privacy enhancing technologies, Big Data privacy
concerns and the over-arching challenge of deriving knowledge
from Big Data while preserving privacy.
A. Big Data Analytics for Security –Pratyusa Manadhata, HP
and Computer Security Aalliance
Enterprises routinely collect terabytes of security relevant
data (e.g., network events, software application events, and
people action events) for several reasons, including the need
for regulatory compliance and post-hoc forensic analysis. We
estimate that large enterprises may generate 10-100 billion
events per day depending on their size. These numbers will
grow as enterprises enable event logging in more sources, hire
more employees, deploy more devices, and run more software.
Unfortunately, this volume of data quickly becomes
overwhelming. Existing analytical techniques do not work well
at this scale and typically produce so many false positives that
their efficacy is undermined. The problem becomes worse as
enterprises move to cloud architectures and collect much more
data. We will discuss techniques to mitigate this problem.
B. Cyber Security and the Industrial Internet –Stephen
Mellor, Industrial Internet Consortium
Through its public-private partnership, the IIC is committed
to working with public and private partnerships to ensure that
security and privacy are integral parts of Industrial Internet
products and services. The IIC is working with its ecosystem to
identify the requirements for communication protocols and
create mechanisms to enhance rapid discovery, mitigation, and
remediation of vulnerabilities in near real-time. This session
will be an open discussion on how the IIC is defining future
requirements and recommendations to ensure the Industrial
Internet is private and secure.
C. NIST Big Data Security and Privacy –Mark Underwood,
Krypton Brothers
The NIST Big Data Interoperability Framework Volume 4:
Security and Privacy Requirements was prepared by the NBD-
PWG’s Security and Privacy Subgroup to identify security and
privacy issues particular to Big Data. Big Data application
domains include health care, drug discovery, finance and many
others from both the private and public sectors. Among the sce-
narios within these application domains are health exchanges,
clinical trials, mergers and acquisitions, device telemetry, and
international anti-piracy. Security technology domains include
identity, authorization, audit, network and device security, and
federation across trust boundaries.
Clearly, the advent of Big Data has necessitated paradigm
shifts in the understanding and enforcement of security and
privacy (S&P) requirements. Significant changes are evolving,
notably in scaling existing solutions to meet the volume,
variety, and velocity of Big Data, and re-targeting security
solutions amid shifts in technology infrastructure, e.g., dis-
tributed computing systems and non-relational data storage. In
addition, as diverse datasets become ever-easier to access,
many are increasingly personal in nature. Thus, a whole new
set of emerging issues must be addressed, including balancing
privacy and utility, enabling analytics and governance on
encrypted data, and reconciling authentication and anonymity.
Working with other subgroups in the NBD-PWG, this
subgroup has begun to expand the distributed computing
concept of a Big Data security fabric.
With the key Big Data characteristics of variety, volume,
and velocity in mind, the subgroup gathered use cases from
volunteers, developed a consensus security and privacy taxon-
omy and reference architecture, and validated it by mapping
the use cases to the reference architecture.
D. Education Data Pricacy and State Boards of Education –
Amelia Vance, National Association of State Borads of
Education
Big data has the potential to revolutionize education, al-
lowing for more efficient and effective schools. It can allow
every teacher to personalize every element of instruction, and
enable policymakers to see exactly which elements of each
educational policy are successful in helping ensure students are
college-and career-ready. However, while many technologists
believe that the benefits of Big Data in education are self-
evident and outweigh any dangers of collecting sensitive stu-
dent information, many parents, teachers, and policymakers do
not feel the same way. Only now are parents learning about the
data schools are collecting about their children. They are justly
concerned about how it is used and shared— the fact that data
collection is often outsourced to third-party vendors only adds
to their skepticism and concerns for their childs privacy. This
has led to an instinctual response by many policymakers and
others to work against the use of Big Data in education, despite
the potential benefits it may have for education. In 2014, state
legislatures introduced 110 bills in 36 states regarding student
data privacy. Seventy-nine of the 2014 bills have at least some
elements that would restrict the use of data in education. For
example, New Hampshires bill, which was passed into law,
likely prevents predictive analytics. A bill in Missouri would
have defunded their statewide longitudinal data system. In all,
28 of the 110 bills introduced passed into law this year. And,
the number of student data privacy bills is expected to double
in the 2015 legislative session.
Many of the bills introduced, and the laws passed, give
state boards of education (SBEs) a key role in the data privacy
discussion. Eighteen SBEs are tasked by statute with writing
their states student data management policy or have oversight
authority for the agency that is writing the policy. Thirteen
SBEs are members of their states data management team.
14

Seven SBEs are required by statute to ensure FERPA com-
pliance. Fifty-five bills introduced in 2014 would give SBEs
some authority in regulating student data privacy. Existing
state privacy laws give many SBEs authority over various
things to help secure data privacy, including appointing a chief
privacy officer, adopting and/or implementing state privacy
policies, and providing oversight of vendor contracts. SBEs
have also independently passed rules for their states to protect
data privacy. Unfortunately, like many other policymakers,
many SBE members are unaware of the potential benefits of
Big Data in education. Education data privacy requires
knowledge of privacy law, a basic understanding of Big Data,
and a great deal of time to learn about the ins and outs of
todays education data privacy debate. The National
Association of State Boards of Education (NASBE) is helping
SBEs understand and pass effective policies on these issues
that will protect data privacy while supporting educational
innovation through the use of Big Data. In this panel, Amelia
Vance from NASBE will discuss the role SBEs play in
education data collection, the questions they are asking as they
put together state privacy policies (particularly those dealing
with third party use of data), and what information
policymakers need from technology providers in order to trust
the use of Big Data in education.
We consider the perspectives and recommendations from
multiple organizations and experts, including the Data Quality
Campaign, the Electronic Privacy Information Center, and the
Pioneer Institute, as well as examine the lessons learned thus
far in states from failed attempts in responsible data collection
and privacy security.
ACKNOWLEDGMENT
The authors wish to thank the panelists for their time and
efforts to share their expertise and further the dialog for
clarifying the new discipline of Big Data. The authors also
wish to acknowledge the contributions of the large group of
participants in the NBD-PWG, who have discussed at length
the emerging discipline of Big Data, and have helped form a
collective understanding of this new paradigm.
REFERENCES
[1] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
Volume 1, Definitions” NIST. unpublished.
[2] N. Grady, W. Chang, eds. “NIST Big Data Interoperability Framework:
Volume 2, Taxonomy” NIST. unpublished.
[3] G. Fox, W. Chang, eds. “NIST Big Data Interoperability Framework:
Volume 3, Use Cases and Requirements” NIST. unpublished.
[4] A. Roy, M. Underwood, W. Chang, eds. “NIST Big Data
Interoperability Framework: Volume 4, Security and Privacy
Requirements” NIST. unpublished.
[5] S. Mishra, W. Chang, eds. “NIST Big Data Interoperability Framework:
Volume 5, Architectures White Paper Survey” NIST. unpublished.
[6] O. Levin, W. Chang, eds. “NIST Big Data Interoperability Framework:
Volume 6, Reference Architecture” NIST. unpublished.
[7] D. Boyd, C. Buffington, W. Chang, eds. “NIST Big Data
Interoperability Framework: Volume 7, Taxonomy” NIST. unpublished.
15

Big data: Challenges, Practices and Technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Big data: Challenges, Practices and Technologies

Similar to Big data: Challenges, Practices and Technologies (20)

More from Navneet Randhawa

More from Navneet Randhawa (20)

Recently uploaded

Recently uploaded (20)

Big data: Challenges, Practices and Technologies