Amir Sedighi
February 2017
Dark Data
Risks and Opportunities
@amirsedighi
Speaker
Amir Sedighi
Software Engineer

Data Solutions Architect
Founder at recommender.ir
twitter: @amirsedighi
By even the most conservative estimates, the amount
of data in the world doubles every two years.
Data Era
May Venn Diagram helps us!
Big Data
May Venn Diagram helps us!
Tabular/
Relational/
RDBMS
Data
Big Data
May Venn Diagram helps us!
Dark Data
Tabular/
Relational/
RDBMS
Data
Big Data
May Venn Diagram helps us!
Dark Data
Tabular/
Relational/
RDBMS
Data
(Structured/Unstructured)
(Almost Unstructured)
(Structured)
Big Data
May Venn Diagram helps us!
Dark Data
Tabular/
Relational/
RDBMS
Data
(Structured/Unstructured)
(Almost Unstructured)
(Structured)
Big Data
Almost can’t be
processed or analyzed
Gartner defines dark data as the information assets
organizations collect, process and store during
regular business activities, but generally fail to use
for other purposes (for example, analytics, business
relationships and direct monetizing).
Dark Data Definition by Gartner
Gartner defines dark data as the information assets
organizations collect, process and store during
regular business activities, but generally fail to use
for other purposes (for example, analytics, business
relationships and direct monetizing).
Similar to dark matter in physics, dark data often
comprises most organizations’ universe of
information assets.
Dark Data Definition by Gartner
Gartner defines dark data as the information assets
organizations collect, process and store during
regular business activities, but generally fail to use
for other purposes (for example, analytics, business
relationships and direct monetizing).
Similar to dark matter in physics, dark data often
comprises most organizations’ universe of
information assets.
Thus, organizations often retain dark data for
compliance purposes only. Storing and securing
data typically incurs more expense (and sometimes
greater risk) than value.
Dark Data Definition by Gartner
Gartner defines dark data as the information assets
organizations collect, process and store during
regular business activities, but generally fail to use
for other purposes (for example, analytics, business
relationships and direct monetizing).
Similar to dark matter in physics, dark data often
comprises most organizations’ universe of
information assets.
Thus, organizations often retain dark data for
compliance purposes only. Storing and securing
data typically incurs more expense (and sometimes
greater risk) than value.
Dark Data Definition by Gartner
Dark Data - A more Sensible Definition
Dark Data - A more Sensible Definition
Organizations Generate
and Gather Data
Dark Data - A more Sensible Definition
Organizations Generate
and Gather Data
A large portion of the
collected data are
never even analyzed!
Dark Data - A more Sensible Definition
Organizations Generate
and Gather Data
A large portion of the
collected data are
never even analyzed!
90% of the data are
never analyzed
Dark Data - A more Sensible Definition
Organizations Generate
and Gather Data
A large portion of the
collected data are
never even analyzed!
90% of the data are
never analysed.
• Customer Information
• Log Files
• Previous Employee Information
• Previous Webpages
• Sensor Data
• Email Correspondences
• Account Information
• Notes or Presentations
• Old Versions of Relevant
Documents
80%..90% is Dark Data
Does Your Org have any Dark Data?
I am just going to
check if we have
any dark data in
the cellar…
Brining Dark Data into Light
1. Gathering
2. Storing/Processing
3. Analyzing and Bringing it into decisions
Brining Dark Data into Light
Brining Dark Data into Light
Brining Dark Data into Light
Brining Dark Data into Light
Brining Dark Data into Light
Brining Dark Data into Light
Question
All companies know data is going to provide value.
Why there is so much of dark data?
Why there is so much of dark data?
• Lack of insight about data
• Lack of ambitions to improve
• Disconnect among departments
• Lopsided priorities
• Lack of technologies to Capture and Store
• Lack of resources/infrastructures to make it available
• Lack of CPU and technics to analyze the data
The issues you face with Dark Data
• Legal and Regulatory Issues
• Loss of Reputation
• Intelligence Risk
• Operation Costs
• Opportunity Costs
Some essential questions
• What can we gather?
• What may we extract from it?
• How we may prune it?
• How long should we keep it?
• What are the storage options?
• What are the processing options?
• How much is the value of each block of data
(Approximately)
• Running limited boundary scenarios
Software Tools & Frameworks on DD
Software Tools & Frameworks on DD
Software Tools & Frameworks on DD
Log Management
Software Tools & Frameworks on DD
Indexing and Search
Software Tools & Frameworks on DD
Data Streaming
Software Tools & Frameworks on DD
Software Tools & Frameworks on DD
Software Tools & Frameworks on DD
Machine Learning and Graph Processing
• Mahout
• MLLib
• FlinkMK
• Theano
• Torch
• TensorFlow
• GraphX
• Gelly
A common Pipeline
Machine
Learning
Steam Processing
Query
Already Processed Data
Real World RT Events
A common Pipeline
Machine
Learning
Steam Processing
Query
Already Processed Data
Real World RT Events
New Pipeline
Questions?
Keep in touch:
twitter: @amirsedighi
1. http://www.gartner.com/it-glossary/dark-data/
2. http://www.itproportal.com/2016/03/07/5-benefits-of-putting-dark-data-to-work/
3. http://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html
4. https://www.youtube.com/watch?v=_fBMmQo-Z4E
5. http://confluent.io
6. https://www.ecmconnection.com/doc/the-various-shades-of-dark-data-0001
7. https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/
References

Dark data

  • 1.
    Amir Sedighi February 2017 DarkData Risks and Opportunities @amirsedighi
  • 2.
    Speaker Amir Sedighi Software Engineer
 DataSolutions Architect Founder at recommender.ir twitter: @amirsedighi
  • 3.
    By even themost conservative estimates, the amount of data in the world doubles every two years. Data Era
  • 4.
    May Venn Diagramhelps us! Big Data
  • 5.
    May Venn Diagramhelps us! Tabular/ Relational/ RDBMS Data Big Data
  • 6.
    May Venn Diagramhelps us! Dark Data Tabular/ Relational/ RDBMS Data Big Data
  • 7.
    May Venn Diagramhelps us! Dark Data Tabular/ Relational/ RDBMS Data (Structured/Unstructured) (Almost Unstructured) (Structured) Big Data
  • 8.
    May Venn Diagramhelps us! Dark Data Tabular/ Relational/ RDBMS Data (Structured/Unstructured) (Almost Unstructured) (Structured) Big Data Almost can’t be processed or analyzed
  • 9.
    Gartner defines darkdata as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Dark Data Definition by Gartner
  • 10.
    Gartner defines darkdata as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Dark Data Definition by Gartner
  • 11.
    Gartner defines darkdata as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value. Dark Data Definition by Gartner
  • 12.
    Gartner defines darkdata as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value. Dark Data Definition by Gartner
  • 13.
    Dark Data -A more Sensible Definition
  • 14.
    Dark Data -A more Sensible Definition Organizations Generate and Gather Data
  • 15.
    Dark Data -A more Sensible Definition Organizations Generate and Gather Data A large portion of the collected data are never even analyzed!
  • 16.
    Dark Data -A more Sensible Definition Organizations Generate and Gather Data A large portion of the collected data are never even analyzed! 90% of the data are never analyzed
  • 17.
    Dark Data -A more Sensible Definition Organizations Generate and Gather Data A large portion of the collected data are never even analyzed! 90% of the data are never analysed. • Customer Information • Log Files • Previous Employee Information • Previous Webpages • Sensor Data • Email Correspondences • Account Information • Notes or Presentations • Old Versions of Relevant Documents
  • 18.
  • 19.
    Does Your Orghave any Dark Data? I am just going to check if we have any dark data in the cellar…
  • 20.
    Brining Dark Datainto Light 1. Gathering 2. Storing/Processing 3. Analyzing and Bringing it into decisions
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Question All companies knowdata is going to provide value. Why there is so much of dark data?
  • 28.
    Why there isso much of dark data? • Lack of insight about data • Lack of ambitions to improve • Disconnect among departments • Lopsided priorities • Lack of technologies to Capture and Store • Lack of resources/infrastructures to make it available • Lack of CPU and technics to analyze the data
  • 29.
    The issues youface with Dark Data • Legal and Regulatory Issues • Loss of Reputation • Intelligence Risk • Operation Costs • Opportunity Costs
  • 30.
    Some essential questions •What can we gather? • What may we extract from it? • How we may prune it? • How long should we keep it? • What are the storage options? • What are the processing options? • How much is the value of each block of data (Approximately) • Running limited boundary scenarios
  • 31.
    Software Tools &Frameworks on DD
  • 32.
    Software Tools &Frameworks on DD
  • 33.
    Software Tools &Frameworks on DD Log Management
  • 34.
    Software Tools &Frameworks on DD Indexing and Search
  • 35.
    Software Tools &Frameworks on DD Data Streaming
  • 36.
    Software Tools &Frameworks on DD
  • 37.
    Software Tools &Frameworks on DD
  • 38.
    Software Tools &Frameworks on DD Machine Learning and Graph Processing • Mahout • MLLib • FlinkMK • Theano • Torch • TensorFlow • GraphX • Gelly
  • 39.
    A common Pipeline Machine Learning SteamProcessing Query Already Processed Data Real World RT Events
  • 40.
    A common Pipeline Machine Learning SteamProcessing Query Already Processed Data Real World RT Events New Pipeline
  • 41.
  • 42.
    1. http://www.gartner.com/it-glossary/dark-data/ 2. http://www.itproportal.com/2016/03/07/5-benefits-of-putting-dark-data-to-work/ 3.http://www.kdnuggets.com/2015/11/importance-dark-data-big-data-world.html 4. https://www.youtube.com/watch?v=_fBMmQo-Z4E 5. http://confluent.io 6. https://www.ecmconnection.com/doc/the-various-shades-of-dark-data-0001 7. https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/ References