Dark data

Dark Data
Alex Pongpech
Graduate School of Applied Statistics
NIDA

Dark Data
Alex Pongpech
• Even bigger than big

Turning Dark
• Useful data may become dark data after it becomes irrelevant, as it is not
processed fast enough. This is called "perishable insights" in "live flowing
data".
• For example, geolocation of a customer, fraud detection
• According to IBM, about 60 percent of data loses its value immediately.
• IBM estimate that roughly 90 percent of data generated by sensors and
analog-to-digital conversions never get used.
• Not analysing data immediately and letting it go 'dark' can lead to
significant losse
• Not only must processed fast enough but also must act quick enough

Turning DARK
• Organizations retain dark data for a multitude
of reasons, and it is estimated that most
companies are only analyzing 1% of their data.
• A lot of dark data is unstructured, which means
that the information is in formats that may be
difficult to categorise, be read by the computer
and thus analysed.
• Often the reason that business do not analyse
their dark data is because of the amount of
resources it would take and the difficulty of
having that data analysed.
• Because storage is inexpensive, storing data is
easy. However, storing and securing the data
usually entails greater expenses (or even risk)
than the potential return profit.

Why dark data is handled the way it is?
• It is surprising because at the time of data collection, the companies
assume that the data is going to provide value. Companies invest a lot
on data collection so both monetarily and otherwise, data should be
considered important. Here are a few reasons why there is so much
of dark data

Why dark data is handled the way it is?
1. Lopsided priorities data on how the customer arrived at the
application page.
2. Disconnect among departments may not be known to other
departments. This is the way we do it here
3. Technology and tool constraints If data collection is done by
separate technologies and tools in the same organization, it may be
difficult to integrate audio file contents from call center with click
data from websites.

Shed some light on the DARK
Gartner defines dark data as
• the information assets organizations
• collect,
• process
• and store during regular business activities,
• but generally fail to use for other purposes (for example, analytics,
business relationships and direct monetizing).
• In an industrial context, dark data can include information gathered by
sensors and telematics.
• Similar to dark matter in physics, dark data often comprises most
organizations’ universe of information assets.
• Thus, organizations often retain dark data for compliance purposes only.
Storing and securing data typically incurs more expense (and sometimes
greater risk) than value.

Dark Data Example:
IP Location
• a manufacturer of soft drinks which runs a popular website might
think that, of all the data that they have, only those that are
directly relevant to the marketing and sales of their soft drink
products have any value for them. While they also store many
other data, such as the IP location of their users, they fail to see
how these "dark" data can also have value to their company.
• Yet if their data, properly cleansed to a high quality and then
analysed, reveal that 7% of the users of their website are
accessing their service from outside the country where they are
located, in spite of the fact that the product is only directly sold
to retailers within that country, these are in themselves valuable
data, for instance, to those who target ads at users of soft drinks.
• These dark data could also be seen as an opportunity to think
about marketing their product elsewhere. For instance, if 40% of
users from outside the country where the company was located
access their site from India, according to the IP location data,
while only 4% came from the European Union, it would strongly
suggest that a marketing campaign within Europe would have
considerably less chance of success than one aimed at the Indian
sub-continent.

Other Dark Data Examples:
type of device
• Other typical examples of dark data, which
most websites store, but fail to utilize the
value of, include the type of device one
accesses the Internet from, typically a
smartphone, tablet or computer; the web-
browser the Internet is being accessed
through, eg Chrome, Mozilla, Opera, Edge
or IE, among others, and even more
obscure or dark information such as the
number of times users re-set their
password, which would be useful to a
company which specializes in Internet and
password security.

Other Dark Data Examples:
Customer Feedback
• A well-known example of dark data which
goes to waste is where companies have a
feedback form which allows users to give
feedback concerning their website or
service but then they don't have the data
structures in place which allow these data
to be easily analysed, resulting in a failure
to take on board and act on their users
judgments and criticisms, whether positive
or negative (both of which have value), that
users make about their site or service.

How does data go DARK
Customer Feedback

More dark data examples
• Customer Information
• Log Files
• Account Information
• Previous Employee Data
• Financial Statements
• Raw Survey Data
• Email Correspondences
• Notes or Presentations
• Old Versions of Relevant Documents

The dark bites
• Maybe we can use all this data later? This also explains why many
organizations are reluctant to part with dark data, even if they have
no plans to put it to work on their behalf, either in the near term or
further down the planning horizon.
• The dark can bite, organizations must also be aware that the dark
data they possess – or perhaps more chillingly,
• the dark data about them, their customers and their operations that's stored
in the cloud, outside their immediate control and management – can pose
risks to their continued business health and well-being.

Problems from the dark
• Data stored but not used cost money ( NYT says 90% of energy used
by data centers is waster)
• Stored data costs money, according to Datamation by 2020 unused
but stored can add up to $891 billion
• The more data is stored but not used, the higher the risk specially in
privacy

The risks
1. Legal and regulatory risk. If data covered by mandate or regulation
– such as confidential, financial information (credit card or other
account data) or patient records – appears anywhere in dark data
collections, its exposure could involve legal and financial liability.
2. Intelligence risk. If dark data encompasses proprietary or sensitive
information reflective of business operations, practices, competitive
advantages, important partnerships and joint ventures, and so
forth, inadvertent disclosure could adversely affect the bottom line
or compromise important business activities and relationships.

The risks
3. Reputation risk. Any kind of data breach reflects badly on the
organizations affected thereby. This applies as much to dark data
(especially in light of other risks) as to other kinds of breaches.
4. Opportunity costs. Given that the organization has decided not to invest
in analysis and mining of dark data by definition, concerted efforts by
third parties to exploit its value represent potential losses of intelligence
and value based upon its contents.
5. Open-ended exposure. By definition, dark data contains information
that's either too difficult or costly to extract to be mined, or that contains
unknown (and therefore unevaluated) sources of intelligence and
exposure to loss or harm. Dark data's secrets may be very dark and
damaging indeed, but one has no way of knowing for sure.

Mitigating Risks Posed by Dark Data
1. Know where is dark---ongoing inventory and assessment.
2. Turn dark to light ---drive ongoing research into new tools and
technologies
3. understanding where dark data resides, how it's stored, how it's
protected and what kinds of access controls help maintain its security.
4. No man land –Ubiquitous encryption.No dark data should be readily
accessible to casual inspection, under any circumstances.
5. Don’t stay in the dark too long ---Retention policies and safe disposal.
6. Auditing dark data for security purposes.

What are some other major areas in which dark
data is being underutilized besides underutilized
customer information?
• Education and Healthcare.
• The potential to service students and patients in the manner in which
the consumer and financial services pursue their target population is
huge.
• So much paperwork is involved in both education and academics, so
the data is there—and in the age of electronic health records
government incentives, much of it in the healthcare space is now
digital.
• However, it needs to be mined and analyzed in order to lead to
opportunities that effect the change which usually results from the
strategic use of personal and behavioral data.

What kind of businesses can really benefit
from dark data extraction and processing?
• Business that sells a product, service or idea—anyone who has
customers—can benefit.
• How many times
• a user resets their password IP address when a user logs into your
website/app
• Last email communication date to your customers
• Mobile handset type, or web browser version
• Free text feedback on a hotel stay or recent flight
• Additional passengers or guest names on a ticket or hotel room
• These data points or features are often overlooked by marketing
teams as serving any useful purpose, as there is a perception that this
type of information is only collected for compliance, fraud or
regulatory requirements

How old is too old when it comes to dark
data?
• Nothing is ever too old unless it is too old
• That said, if you’re analyzing, say, customer sentiment in social media,
you simply won’t have relevant data that predates the advent of
social channels. So in that case, dark data from before those channels
existed could be considered “too old.”

How can you turn dark data into active,
revenue generating data?
• This is where data science, marketing, and business intelligence need to get their
heads together to find new ways of activating dark data to provide new
opportunities for the organization. While dark data can appear dull and
uninteresting on the surface; there are methods to turn it into highly granular,
rich customer insights.
• Here are a few key steps to get you started on the above examples:
• Log-ins to your website or mobile application, what city/country are the IP
addresses? Are you logging each location a user visit and creating a virtual map of
their travels? This is particularly compelling when creating a 360-degree view of
your customer.
• Additional passengers/guest names on a reservation. Not only does this give
insight to homophily of the user and fuel your social network graph of which
users are centrally connected and influential, but it also provides rich insight into
their family and workplace. Link this data with social graphing, and you’ll quickly
obtain age, gender, and behavioral traits.

How can you turn dark data into active,
revenue generating data?
• Mobile phone data. This simple piece of data will illuminate an array of
new product and marketing opportunities, and provide an additional
segmentation layer to improve marketing effectiveness. From mobile
phone data, it’s possible to know which telco partners you should bring on
board (which will activate even MORE opportunities), you’ll know where
your users are in the world, in real time if they have recently purchased
tickets with another airline, and more.
• Free text input, such as feedback can be passed through cognitive text
analysis tools to determine if the general sentiment of the feedback is
positive of negative. Linking the user profile to your internal database can
also determine if this user is sending mixed messages on social media
compared to surveys and feedback forms. THINK AIRLINE

Four Ways to Use Dark Data
1. Networking machine data. As noted above, servers, firewalls, network
monitoring tools and other parts of your environment generate large
amounts of machine data related to network operations. Avoid dark
networking data by using this information to analyze network security, as
well as to monitor network activity patterns to ensure that your network
infrastructure is never under- or over-utilized.
2. Customer support logs. Most businesses maintain records of customer-
support interactions that include information such as when a customer
contacted the business, which type of communication channel was used,
how long the engagement lasted and so on. Don’t make the mistake of
leaving this data in the dark, or using it only when you need to research a
customer issue. Instead, build it into your analytics workflows by
leveraging it to help understand when your customers are most likely to
contact you, what their preferred methods of contact are and so on.

Four Ways to Use Dark Data
3. “Legacy” system log. If you have mainframes or other older types of
systems running in your environment, you may think that there is no way
to use modern analytics tools to understand them. But you can. By
offloading system logs and other data from these systems into an
analytics platform like Hadoop, you can make sure you are not leaving
this “legacy” data in the dark.
4. Non-textual data. Most data analytics workflows are built around textual
data, which is easier to ingest. You can also make use of video, audio or
other non-textual files, however. You can analyze the meta data
associated with them, or, if appropriate, translate speech to text in order
to gain more insight into the content of the data itself. The effort
required in this regard may not be worth it in all cases, but the bigger
point worth keeping in mind is that your non-textual data doesn’t have
to be dark data. There are ways to make it actionable if you need it to be.

LET THERE BE LIGHT: Dark Data Analytics
• Dark analytics efforts typically focus on three dimensions:
1. Untapped data already in your possession
2. Nontraditional unstructured data
3. Data in the deep web
• o be clear, the purpose of dark analytics is not to catalog vast volumes of
unstructured data. Casting a broader data net without a specific purpose in
mind will likely lead to failure. Indeed, dark analytics efforts that are
surgically precise in both intent and scope often deliver the greatest value.
Like every analytics journey, successful efforts begin with a series of
specific questions. What problem are you solving? What would we do
differently if we could solve that problem? Finally, what data sources and
analytics capabilities will help us answer the first two questions?

DeepDive
• http://deepdive.stanford.edu/quickstart
• DeepDive is a system to extract value from dark data. Like dark matter, dark data
is the great mass of data buried in text, tables, figures, and images, which lacks
structure and so is essentially unprocessable by existing software.
• DeepDive helps bring dark data to light by creating structured data (SQL tables)
from unstructured information (text documents) and integrating such data with
an existing structured database.
• DeepDive is used to extract sophisticated relationships between entities and
make inferences about facts involving those entities.
• DeepDive helps one process a wide variety of dark data and put the results into a
database. With the data in a database, one can use a variety of standard tools
that consume structured data; e.g., visualization tools like Tablaeu or analytics
tools like Excel.
• http://deepdive.stanford.edu/showcase/apps

Lessons from the front lines
• IU HEALTH’S RX FOR MINING DARK DATA
• Retailers make it personal
• Oil Company

IU HEALTH’S RX FOR MINING DARK DATA
• As part of a new model of care, Indiana
University Health (IU Health) is exploring
ways to use nontraditional and unstructured
data to personalize health care for
individual patients and improve overall
health outcomes for the broader
population.
• Traditional relationships between medical
care providers and patients are often
transactional in nature, focusing on
individual visits and specific outcomes
rather than providing holistic care services
on an ongoing basis. IU Health has
determined that incorporating insights from
additional data will help build patient
loyalty and provide more useful, seamless,
and cost-efficient care.

• “IU Health needs a 360-degree understanding of the
patients it serves in order to create the kind of care and
services that will keep them in the system
• For example, consider the voluminous free-form notes—
both written and verbal—that physicians generate
during patient consultations.
• Deploying voice recognition, deep learning, and text
analysis capabilities to these in-hand but previously
underutilized sources could potentially add more depth
and detail to patient medical records.
• These same capabilities might also be used to analyze
audio recordings of patient conversations with IU Health
call centers to further enhance a patient’s records. Such
insights could help IU Health develop a more thorough
understanding of the patient’s needs, and better
illuminate how those patients utilize the health system’s
services.

• Another opportunity involves using dark data to help predict need and manage care
across populations. IU Health is examining how cognitive computing, external data, and
patient data could help identify patterns of illness, health care access, and historical
outcomes in local populations. The approaches could make it possible to incorporate
socioeconomic factors that may affect patients’ engagement with health care providers.
• “There may be a correlation between high density per living unit and disengagement
from health,” says Mark Lantzy, senior vice president and chief information officer, IU
Health. “It is promising that we can augment patient data with external data to
determine how to better engage with people about their health. We are creating the
underlying platform to uncover those correlations and are trying to create something
more systemic.
• The destination for our journey is an improved patient experience,” he continues.
“Ultimately, we want it to drive better satisfaction and engagement. More than deliver
great health care to individual patients, we want to improve population health
throughout Indiana as well. To be able to impact that in some way, even incrementally,
would be hugely beneficial.”

Retailers make it personal
• Retailers almost universally recognize that digital has reshaped customer
behavior and shopping. In fact, $0.56 of every dollar spent in a store is
influenced by a digital interaction.
• Yet many retailers—particularly those with brick-and-mortar operations—
still struggle to deliver the digital experiences customers expect. Some
focus excessively on their competitors instead of their customers and rely
on the same old key performance indicators and data.
• In recent years, however, growing numbers of retailers have begun
exploring different approaches to developing digital experiences. Some are
analyzing previously dark data culled from customers’ digital lives and
using the resulting insights to develop merchandising, marketing, customer
service, and even product development strategies that offer shoppers a
targeted and individualized customer experience.

• Stitch Fix, for example, is an online subscription shopping service that
uses images from social media and other sources to track emerging
fashion trends and evolving customer preferences.
• Its process begins with clients answering a detailed questionnaire
about their tastes in clothing. Then, with client permission, the
company’s team of 60 data scientists augments that information by
scanning images on customers’ Pinterest boards and other social
media sites, analyzing them, and using the resulting insights to a
develop a deeper understanding of each customer’s sense of style.
• Company stylists and artificial intelligence algorithms use these
profiles to select style-appropriate items of clothing to be shipped to
individual customers at regular intervals.

• Meanwhile, grocery supermarket chain Kroger Co. is taking a different
approach that leverages Internet of Things and advanced analytics
techniques. As part of a pilot program, the company is embedding a
network of sensors and analytics into store shelves that can interact
with the Kroger app and a digital shopping list on a customer’s phone.
• As the customer strolls down each aisle, the system—which contains
a digital history of the customer’s purchases and product
preferences—can spotlight specially priced products the customer
may want on 4-inch displays mounted in the aisles. This pilot, which
began in late 2016 with initial testing in 14 stores, is expected to
expand in 2017.

GREG POWERS, VICE PRESIDENT OF TECHNOLOGY,
HALLIBURTON
• Yet the sheer volume of information that we can and do collect goes way
beyond human cognitive bandwidth. Advances in sensor science are
delivering enormous troves of both dark data and what I think of as really
dark data.
• For example, we scan rocks electromagnetically to determine their
consistency. We use nuclear magnetic resonance to perform what amounts
to an MRI on oil wells. Neutron and gamma-ray analysis measures the
electrical permittivity and conductivity of rock. Downhole spectroscopy
measures fluids. Acoustic sensors collect 1–2 terabytes of data daily.
• All of this dark data helps us better understand in-well performance. In
fact, there’s so much potential value buried in this darkness that I flip the
frame and refer to it as “bright data” that we have yet to tap.

HALLIBURTON
• In the next phase of Halliburton’s ongoing analytics program, we want to develop
the capacity to capture, mine, and use bright data insights to become more
predictive.
• Given the nature of our operations, this will be no small task. Identical events
driven by common circumstances are rare in the oil and gas industry. We have 30
years of retrospective data, but there are an infinite number of combinations of
rock, gas, oil, and other variables that affect outcomes.
• Unfortunately, there is no overarching constituent physics equation that can
describe the right action to take for any situation encountered. Yet, even if we
can’t explain what we’ve seen historically, we can explore what has happened
and let our refined appreciation of historic data serve as a road map to where we
can go.
• In other words, we plan to correlate data to things that statistically seem to
matter and, then, use this data to develop a confidence threshold to inform how
we should approach these issues.

HALLIBURTON
• We believe that nontraditional data holds the key to creating advanced intelligent
response capabilities to solve problems, potentially without human intervention, before
they happen.
• At the lowest level, we’ll take measurements and tell someone after the fact that
something happened. At the next level, our goal will be to recognize that something has
happened and, then, understand why it happened. The following step will use real-time
monitoring to provide in-the-moment awareness of what is taking place and why. In the
next tier, predictive tools will help us discern what’s likely to happen next. The most
extreme offering will involve automating the response—removing human intervention
from the equation entirely.
• Drilling is complicated work. To make it more autonomous and efficient, and to free
humans from mundane decision making, we need to work smarter. Our industry is facing
a looming generational change. Experienced employees will soon retire and take with
them decades of hard-won expertise and knowledge. We can’t just tell our new hires,
“Hey, go read 300 terabytes of dark data to get up to speed.” We’re going to have to rely
on new approaches for developing, managing, and sharing data-driven wisdom.

Where do you start?
Ask the right questions:
• Rather than attempting to discover and inventory all of the dark data
hidden within and outside your organization, work with business teams to
identify specific questions they want answered. Work to identify potential
dark analytics sources and the untapped opportunities contained therein.
• Then focus your analytics efforts on those data streams and sources that
are particularly relevant.
• For example, if marketing wants to boost sales of sports equipment in a
certain region, analytics teams can focus their efforts on real-time sales
transaction streams, inventory, and product pricing data at select stores
within the target region. They could then supplement this data with
historic unstructured data—in-store video analysis of customer foot traffic,
social sentiment, influencer behavior, or even pictures of displays or
product placement across sites—to generate more nuanced insights.

Look outside your organization:
• You can augment your own data with publicly available demographic,
location, and statistical information. Not only can this help your
analytics teams generate more expansive, detailed reports—it can put
insights in a more useful context.
• For example, a physician makes recommendations to an asthma
patient based on her known health history and a current examination.
By reviewing local weather data, he can also provide short-term
solutions to help her through a flare-up during pollen season. In
another example, employers might analyze data from geospatial
tools, traffic patterns, and employee turnover to determine the
extent to which employee job satisfaction levels are being adversely
impacted by commute times.

Augment data talent:
• Data scientists are an increasingly valuable resource, especially those who
can artfully combine deep modeling and statistical techniques with
industry or function-specific insights and creative problem framing. Going
forward, those with demonstrable expertise in a few areas will likely be in
demand.
• For example, both machine learning and deep learning require
programmatic expertise—the ability to build established patterns to
determine the appropriate combination of data corpus and method to
uncover reasonable, defensible insights. Likewise, visual and graphic design
skills may be increasingly critical given that visually communicating results
and explaining rationales are essential for broad organizational adoption.
• Finally, traditional skills such as master data management and data
architecture will be as valuable as ever—particularly as more companies
begin laying the foundations they’ll need to meet the diverse, expansive,
and exploding data needs of tomorrow.

Explore advanced visualization tools:
• Not everyone in your organization will be able to digest a printout of
advanced Bayesian statistics and apply them to business practices.
• Most people need to understand the “so what” and the “why” of complex
analytical insights before they can turn insight into action. In many
situations, information can be more easily digested when presented as an
infographic, a dashboard, or another type of visual representation.
• Visual and design software packages can do more than generate eye-
catching graphics such as bubble charts, word clouds, and heat maps—they
can boost business intelligence by repackaging big data into smaller, more
meaningful chunks, delivering value to users much faster. Additionally, the
insights (and the tools) can be made accessible across the enterprise,
beyond the IT department, and to business users at all levels, to create
more agile, cross-functional teams.

View it as a business-driven effort:
• It’s time to recognize analytics as an overall business strategy rather than
as an IT function. To that end, work with C-suite colleagues to garner
support for your dark analytics approach.
• Many CEOs are making data a cornerstone of overall business strategy,
which mandates more sophisticated techniques and accountability for
more deliberate handling of the underlying assets.
• By understanding your organization’s agenda and goals, you can determine
the value that must be delivered, define the questions that should be
asked, and decide how to harness available data to generate answers.
• Data analytics then becomes an insight-driven advantage in the
marketplace. The best way to help ensure buy-in is to first pilot a project
that will demonstrate the tangible ROI that can be realized by the
organization with a businesswide analytics strategy.

Think broadly:
• As you develop new capabilities and strategies, think about how you
can extend them across the organization as well as to customers,
vendors, and business partners. Your new data strategy becomes part
of your reference architecture that others can use.

Thank you
and
May the light shines on you

References
1. http://www.kdnuggets.com/2015/11/importance-dark-data-big-data-
world.html
2. http://www.kdnuggets.com/2015/01/shining-light-on-dark-data.html
3. http://www.kdnuggets.com/2016/03/rise-dark-data-how-
harnessed.html
4. http://www.kdnuggets.com/solutions/fraud-detection.html
5. https://en.wikipedia.org/wiki/Operational_database
6. http://blog.syncsort.com/2017/05/big-data/4-dark-data-examples-use-
cases/
7. Tracie Kambies, Paul Roma, Nitin Mittal, Sandeep Kumar Sharma,
https://dupress.deloitte.com/dup-us-en/focus/tech-trends/2017/dark-
data-analyzing-unstructured-data.html

Dark data

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Dark data

Similar to Dark data (20)

More from Worapol Alex Pongpech, PhD

More from Worapol Alex Pongpech, PhD (9)

Recently uploaded

Recently uploaded (20)

Dark data