The document is a report on unstructured data and the enterprise from Second Quarter 2012. It contains:
1) Various text formats like documents, emails, and web pages make up the largest portion of unstructured data in enterprises. Many firms are implementing projects to extract useful information from large amounts of email data.
2) Content management systems help enterprises manage and obtain information from unstructured text documents using metadata to enhance search and reporting.
3) The report discusses how capturing and managing unstructured data through techniques like metadata, taxonomy creation, and text analytics can provide competitive advantages for firms.
Natural Language Understanding in HealthcareDavid Talby
The ability of software to reason, answer questions and intelligently converse about clinical notes, patient stories or biomedical papers has risen dramatically in the past few years.
This talk covers state of the art natural language processing, deep learning, and machine learning libraries in this space. We'll share benchmarks from industry & research projects on use cases such as clinical data abstraction, patient risk prediction, named entity recognition & resolution, and negation scope detection.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Data modelling is considered a staple in the world of data management. The skill of the data modeler and their knowledge of the business plays a large role in successful Enterprise Information Management across many organizations. Data modeling requires formal accountability, attention to metadata and getting the business heavily involved in data requirement development. These are all traits of solid Data Governance programs.
Join Bob Seiner and a special guest modeler extraordinaire in this month’s installment of Real-World Data Governance to discuss data modeling as a form of data governance. Learn how to use the skillfulness of the data modeler to advance data-as-an-asset and governance agendas while conveying the importance and value of both disciplines.
In this webinar Bob and a special guest will talk about:
•Data Modeling as Art or Science
•Role of Data Modeler in a Governance Program
•Data Modeler Skills as Governance Skills
•Modeling and Governance Best Practices
•Leveraging the Model as a Governance Artifact
Modernizing Integration with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3CMqS0E
Today, businesses have more data and data types combined with more complex ecosystems than they have ever had before. Examples include on-premise data marts, data warehouses, data lakes, applications, spreadsheets, IoT data, sensor data, unstructured, etc. combined with cloud data ecosystems like Snowflake, Big Query, Azure Synapse, Amazon S3, Redshift, Databricks, SaaS apps, such as Salesforce, Oracle, Service Now, Workday, and on and on.
Data, Analytics, Data Science and Architecture teams are struggling to provide the business users with the right data as quickly and efficiently as possible to quickly enable Analytics, Dashboards, BI, Reports, etc. Unfortunately, many enterprises seek to meet this pressing need by utilizing antiquated and legacy 40+ year-old approaches. There is a better way. Proven by thousands of other companies.
As Forrester so astutely reported in their recent Total Economic Impact Study, companies who employed Data Virtualization reported a “65% decrease in data delivery times over ETL” and an “83% reduction in time to new revenue.”
Join us for this very educational webinar to learn firsthand from Denodo Technologies and Fusion Alliance how:
- Data Virtualization helps your company save time and money by eliminating superfluous ETL pipelines and data replication.
- Data Virtualization can become the cornerstone of your modern data approach to deliver data faster and more efficiently than old legacy approaches at enterprise scale.
- How quickly and easily, Data Virtualization can scale, even in the most complex environments, to create a universal abstraction semantic model(s) for all of your cloud, on premise, structured, unstructured and hybrid data
- Data Mesh and Data Fabric architecture patterns for maximum reuse
- Other customers have used, and are using, Data Virtualization to tackle their toughest data integration and data delivery challenges
- Fusion Alliance can help you define a data strategy tailored to your organization’s needs and requirements, and how they can help you achieve success and enable your business with self-service capabilities
Credit Card Fraud Detection Using ML In DatabricksDatabricks
In the Credit Card Companies, illegitimate credit card usage is a serious problem which results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. All organizations can be hugely impacted by fraud and fraudulent activities, especially those in financial services. The threat can originate from internal or external, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, even up to downfall of a corporation. Despite regular fraud prevention measures, these are constantly being put to the test in an attempt to beat the system.
Fraud detection is a task of predicting whether a card has been used by the cardholder. One of the methods to recognize fraud card usage is to leverage Machine Learning (ML) models. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of dataset including credit card transaction information as well as card and demographic information of the owner of the account. This will be our goal of the project while leveraging Databricks.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Natural Language Understanding in HealthcareDavid Talby
The ability of software to reason, answer questions and intelligently converse about clinical notes, patient stories or biomedical papers has risen dramatically in the past few years.
This talk covers state of the art natural language processing, deep learning, and machine learning libraries in this space. We'll share benchmarks from industry & research projects on use cases such as clinical data abstraction, patient risk prediction, named entity recognition & resolution, and negation scope detection.
Large amounts of heterogeneous medical data have become available in various healthcare organizations (payers, providers, pharmaceuticals). Those data could be an enabling resource for deriving insights for improving care delivery and reducing waste. The enormity and complexity of these datasets present great challenges in analyses and subsequent applications to a practical clinical environment. More details are available here http://dmkd.cs.wayne.edu/TUTORIAL/Healthcare/
Data modelling is considered a staple in the world of data management. The skill of the data modeler and their knowledge of the business plays a large role in successful Enterprise Information Management across many organizations. Data modeling requires formal accountability, attention to metadata and getting the business heavily involved in data requirement development. These are all traits of solid Data Governance programs.
Join Bob Seiner and a special guest modeler extraordinaire in this month’s installment of Real-World Data Governance to discuss data modeling as a form of data governance. Learn how to use the skillfulness of the data modeler to advance data-as-an-asset and governance agendas while conveying the importance and value of both disciplines.
In this webinar Bob and a special guest will talk about:
•Data Modeling as Art or Science
•Role of Data Modeler in a Governance Program
•Data Modeler Skills as Governance Skills
•Modeling and Governance Best Practices
•Leveraging the Model as a Governance Artifact
Modernizing Integration with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3CMqS0E
Today, businesses have more data and data types combined with more complex ecosystems than they have ever had before. Examples include on-premise data marts, data warehouses, data lakes, applications, spreadsheets, IoT data, sensor data, unstructured, etc. combined with cloud data ecosystems like Snowflake, Big Query, Azure Synapse, Amazon S3, Redshift, Databricks, SaaS apps, such as Salesforce, Oracle, Service Now, Workday, and on and on.
Data, Analytics, Data Science and Architecture teams are struggling to provide the business users with the right data as quickly and efficiently as possible to quickly enable Analytics, Dashboards, BI, Reports, etc. Unfortunately, many enterprises seek to meet this pressing need by utilizing antiquated and legacy 40+ year-old approaches. There is a better way. Proven by thousands of other companies.
As Forrester so astutely reported in their recent Total Economic Impact Study, companies who employed Data Virtualization reported a “65% decrease in data delivery times over ETL” and an “83% reduction in time to new revenue.”
Join us for this very educational webinar to learn firsthand from Denodo Technologies and Fusion Alliance how:
- Data Virtualization helps your company save time and money by eliminating superfluous ETL pipelines and data replication.
- Data Virtualization can become the cornerstone of your modern data approach to deliver data faster and more efficiently than old legacy approaches at enterprise scale.
- How quickly and easily, Data Virtualization can scale, even in the most complex environments, to create a universal abstraction semantic model(s) for all of your cloud, on premise, structured, unstructured and hybrid data
- Data Mesh and Data Fabric architecture patterns for maximum reuse
- Other customers have used, and are using, Data Virtualization to tackle their toughest data integration and data delivery challenges
- Fusion Alliance can help you define a data strategy tailored to your organization’s needs and requirements, and how they can help you achieve success and enable your business with self-service capabilities
Credit Card Fraud Detection Using ML In DatabricksDatabricks
In the Credit Card Companies, illegitimate credit card usage is a serious problem which results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. All organizations can be hugely impacted by fraud and fraudulent activities, especially those in financial services. The threat can originate from internal or external, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, even up to downfall of a corporation. Despite regular fraud prevention measures, these are constantly being put to the test in an attempt to beat the system.
Fraud detection is a task of predicting whether a card has been used by the cardholder. One of the methods to recognize fraud card usage is to leverage Machine Learning (ML) models. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of dataset including credit card transaction information as well as card and demographic information of the owner of the account. This will be our goal of the project while leveraging Databricks.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
What Is Unstructured Data And Why Is It So Important To Businesses?Bernard Marr
Unstructured data is created at an incredible rate each day and with the advent of artificial intelligence and machine learning tools to gather, process, analyse and report insights from unstructured data, it now provides important business value to organizations. It’s essential for all businesses to start making the most of their unstructured data.
Gain insights from data analytics and take action! Learn why everyone is making a big deal about big data in healthcare and how data analytics creates action.
OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.
This is a presentation I gave on Data Visualization at a General Assembly event in Singapore, on January 22, 2016. The presso provides a brief history of dataviz as well as examples of common chart and visualization formatting mistakes that you should never make.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Below are the topics covered in this tutorial:
What is Data Visualization?
What is Tableau?
Why Tableau?
Tableau Job Trends
Companies using Tableau
Who should go for Tableau?
Tableau Architecture
Tableau Visualizations
Real time Use Case
Data visualizations make huge amounts of data more accessible and understandable. Data visualization, or "data viz," is becoming largely important as the amount of data generated is increasing and big data tools are helping to create meaning behind all of that data.
This SlideShare presentation takes you through more details around data visualization and includes examples of some great data visualization pieces.
This presentation have the concept of Big data.
Why Big data is important to the present world.
How to visualize big data.
Steps for perfect visualization.
Visualization and design principle.
Also It had a number of visualization method for big data and traditional data.
Advantage of Visualization in Big Data
Se você quer conhecer mais sobre a tecnologia de grafos, este webinar de introdução é perfeito para começar a explorar o potencial dos relacionamentos entre os dados para o seu negócio.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
As enterprises and other organizations look to sift through and make sense of the all the structured and unstructured data available to them, dashboards are being positioned as the solution to their problems. However, a dashboard needs special attention to the visual design or the dashboard will fail to meet expectations. If not carefully designed, a new dashboard can leave consumers unsatisfied, frustrated, confused, and even overwhelmed.
Aaron Hursman shares his experiences designing effective dashboards that deliver on the promise of targeted, accessible, and actionable information. He discusses examples from both his personal work and contributions of thought-leaders in this space. Through these examples, he presents practical dashboard design techniques. He also presents effective approaches to take during design and construction. Finally, Aaron explains the challenges and obstacles that dashboard designers face today, and how to mitigate the risks that result.
[Originally presented to RefreshDallas, 16 Oct 2008]
An overview of clinical healthcare data analytics from the perspective of an interventional cardiology registry. This was initially presented as part of a workshop at the University of Illinois College of Computer Science on April 20, 2017.
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
How can you make sense of messy data? How do you wrap structure around non-relational, flexibly structured data? With the growth in cloud technologies, how do you balance the need for flexibility and scale with the need for structure and analytics? Join us for an overview of the marketplace today and a review of the tools needed to get the job done.
During this hour, we'll cover:
- How big data is challenging the limits of traditional data management tools
- How to recognize when tools like MongoDB, Hadoop, IBM Cloudant, R Studio, IBM dashDB, CouchDB, and others are the right tools for the job.
What Is Unstructured Data And Why Is It So Important To Businesses?Bernard Marr
Unstructured data is created at an incredible rate each day and with the advent of artificial intelligence and machine learning tools to gather, process, analyse and report insights from unstructured data, it now provides important business value to organizations. It’s essential for all businesses to start making the most of their unstructured data.
Gain insights from data analytics and take action! Learn why everyone is making a big deal about big data in healthcare and how data analytics creates action.
OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.
This is a presentation I gave on Data Visualization at a General Assembly event in Singapore, on January 22, 2016. The presso provides a brief history of dataviz as well as examples of common chart and visualization formatting mistakes that you should never make.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Below are the topics covered in this tutorial:
What is Data Visualization?
What is Tableau?
Why Tableau?
Tableau Job Trends
Companies using Tableau
Who should go for Tableau?
Tableau Architecture
Tableau Visualizations
Real time Use Case
Data visualizations make huge amounts of data more accessible and understandable. Data visualization, or "data viz," is becoming largely important as the amount of data generated is increasing and big data tools are helping to create meaning behind all of that data.
This SlideShare presentation takes you through more details around data visualization and includes examples of some great data visualization pieces.
This presentation have the concept of Big data.
Why Big data is important to the present world.
How to visualize big data.
Steps for perfect visualization.
Visualization and design principle.
Also It had a number of visualization method for big data and traditional data.
Advantage of Visualization in Big Data
Se você quer conhecer mais sobre a tecnologia de grafos, este webinar de introdução é perfeito para começar a explorar o potencial dos relacionamentos entre os dados para o seu negócio.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
As enterprises and other organizations look to sift through and make sense of the all the structured and unstructured data available to them, dashboards are being positioned as the solution to their problems. However, a dashboard needs special attention to the visual design or the dashboard will fail to meet expectations. If not carefully designed, a new dashboard can leave consumers unsatisfied, frustrated, confused, and even overwhelmed.
Aaron Hursman shares his experiences designing effective dashboards that deliver on the promise of targeted, accessible, and actionable information. He discusses examples from both his personal work and contributions of thought-leaders in this space. Through these examples, he presents practical dashboard design techniques. He also presents effective approaches to take during design and construction. Finally, Aaron explains the challenges and obstacles that dashboard designers face today, and how to mitigate the risks that result.
[Originally presented to RefreshDallas, 16 Oct 2008]
An overview of clinical healthcare data analytics from the perspective of an interventional cardiology registry. This was initially presented as part of a workshop at the University of Illinois College of Computer Science on April 20, 2017.
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
How can you make sense of messy data? How do you wrap structure around non-relational, flexibly structured data? With the growth in cloud technologies, how do you balance the need for flexibility and scale with the need for structure and analytics? Join us for an overview of the marketplace today and a review of the tools needed to get the job done.
During this hour, we'll cover:
- How big data is challenging the limits of traditional data management tools
- How to recognize when tools like MongoDB, Hadoop, IBM Cloudant, R Studio, IBM dashDB, CouchDB, and others are the right tools for the job.
The Business Case for Robotic Process Automation (RPA)Joe Tawfik
This paper by Kinetic Consulting Services (www.kineticcs.com) outlines the business case for Robotic Process Automation (RPA). It examines the commercial and strategic aspects of RPA.
Introduction to Object Storage Solutions White PaperHitachi Vantara
Learn more about Hitachi Content Platform Anywhere by visiting http://www.hds.com/products/file-and-content/hitachi-content-platform-anywhere.html
and more information on the Hitachi Content Platform is at http://www.hds.com/products/file-and-content/content-platform
SA2: Text Mining from User Generated ContentJohn Breslin
ICWSM 2011 Tutorial
Lyle Ungar and Ronen Feldman
The proliferation of documents available on the Web and on corporate intranets is driving a new wave of text mining research and application. Earlier research addressed extraction of information from relatively small collections of well-structured documents such as newswire or scientific publications. Text mining from the other corpora such as the web requires new techniques drawn from data mining, machine learning, NLP and IR. Text mining requires preprocessing document collections (text categorization, information extraction, term extraction), storage of the intermediate representations, analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.), and visualization of the results. In this tutorial we will present the algorithms and methods used to build text mining systems. The tutorial will cover the state of the art in this rapidly growing area of research, including recent advances in unsupervised methods for extracting facts from text and methods used for web-scale mining. We will also present several real world applications of text mining. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems, including recent advances in sentiment analysis and how to handle user generated text such as blogs and user reviews.
Lyle H. Ungar is an Associate Professor of Computer and Information Science (CIS) at the University of Pennsylvania. He also holds appointments in several other departments at Penn in the Schools of Engineering and Applied Science, Business (Wharton), and Medicine. Dr. Ungar received a B.S. from Stanford University and a Ph.D. from M.I.T. He directed Penn's Executive Masters of Technology Management (EMTM) Program for a decade, and is currently Associate Director of the Penn Center for BioInformatics (PCBI). He has published over 100 articles and holds eight patents. His current research focuses on developing scalable machine learning methods for data mining and text mining.
Ronen Feldman is an Associate Professor of Information Systems at the Business School of the Hebrew University in Jerusalem. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University and his Ph.D. in Computer Science from Cornell University in NY. He is the author of the book "The Text Mining Handbook" published by Cambridge University Press in 2007.
HfS Webinar Slides: Smart Process Automation in Enterprise BusinessHfS Research
Global businesses must cut operational cost and improve agility, and Smart Process Automation - which combines workforce orchestration, RPA and cognitive automation - delivers on this imperative.
Experts from HfS Research, WorkFusion and Ascension Health discussed how to solve for business outcomes through more integrated automation technologies.
Participants will learn about:
- How SPA relates to the HfS Research Intelligent Automation Continuum
- How this new breed of automation improves on legacy solutions
- The role machine learning plays in SPA
- Use cases for SPA in shared services organizations and specific industries
- How WorkFusion’s SPA platform delivers
- A practical path forward for end users at the enterprise level who wish to explore SPA to achieve their operational mandates
View the replay here: ow.ly/RDwv301FeR3
Emotion detection from text using data mining and text miningSakthi Dasans
Emotion detection from text using data mining and text mining
Based on research paper published by Faculty of Engineering, The University of Tokushima at IEEE 2007 we build an intelligent system under the title Emotelligence on Text to recognize human emotion from textual contents.
i.e. if you give an input string , our system would possibly able to say the emotion behind that textual content.
This eBook outlines the various types of data and explores the future of data analytics with a particular leaning towards unstructured data, both human and machine-generated.
Big Data & Text Mining: Finding Nuggets in Mountains of Textual Data
Big amount of information is available in textual form in databases or online sources, and for many enterprise functions (marketing, maintenance, finance, etc.) represents a huge opportunity to improve their business knowledge. For example, text mining is starting to be used in marketing, more specifically in analytical customer relationship management, in order to achieve the holy 360° view of the customer (integrating elements from inbound mails, web comments, surveys, internal notes, etc.).
Facing this new domain I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The below presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
A successful data migration process can be used for a one-time migration, or as a standard procedure for future migrations employing a consistent, reliable and repeatable methodology incorporating planning, tool implementation and validation. Dell data migration services can help. This whitepaper explores the Dell storage portfolio, Dell methodologies for migration and the use cases for migrating customers over to Dell Storage.
Benefits of Modern Cloud Data Lake Platform Qubole GCP - WhitepaperVasu S
IDC explains how data leaders are adopting cloud data lake platforms built by companies like Qubole and Google Cloud Platform to address the growing need for mission-critical analytics during COVID-19
https://www.qubole.com/resources/white-papers/benefits-of-modern-cloud-data-lake-platform-idc-qubole-gcp
White Paper: Gigya's Information Security and Data Privacy PracticesGigya
As the leading SaaS Customer Identity and Access Management provider for enterprises, Gigya is committed to maintaining a high level of performance and security Our platform is optimized for maximum e ciency and scalability while protecting our clients’ data by adhering to strict security and compliance standards This document provides an overview of Gigya’s standards for the following four categories: Infrastructure, Data Security, Compliance, and Privacy Policies.
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a comprehensive platform designed to address multi-faceted needs by offering multi-function data management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion.
In this research-based session, I’ll discuss what the components are in multiple modern enterprise analytics stacks (i.e., dedicated compute, storage, data integration, streaming, etc.) and focus on total cost of ownership.
A complete machine learning infrastructure cost for the first modern use case at a midsize to large enterprise will be anywhere from $3 million to $22 million. Get this data point as you take the next steps on your journey into the highest spend and return item for most companies in the next several years.
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
Do you ever wonder how data-driven organizations fuel analytics, improve customer experience, and accelerate business productivity? They are successful by governing and mastering data effectively so they can get trusted data to those who need it faster. Efficient data discovery, mastering and democratization is critical for swiftly linking accurate data with business consumers. When business teams can quickly and easily locate, interpret, trust, and apply data assets to support sound business judgment, it takes less time to see value.
Join data mastering and data governance experts from Informatica—plus a real-world organization empowering trusted data for analytics—for a lively panel discussion. You’ll hear more about how a single cloud-native approach can help global businesses in any economy create more value—faster, more reliably, and with more confidence—by making data management and governance easier to implement.
What is data literacy? Which organizations, and which workers in those organizations, need to be data-literate? There are seemingly hundreds of definitions of data literacy, along with almost as many opinions about how to achieve it.
In a broader perspective, companies must consider whether data literacy is an isolated goal or one component of a broader learning strategy to address skill deficits. How does data literacy compare to other types of skills or “literacy” such as business acumen?
This session will position data literacy in the context of other worker skills as a framework for understanding how and where it fits and how to advocate for its importance.
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task – but it’s worth the effort. Getting your Data Strategy right can provide significant value, as data drives many of the key initiatives in today’s marketplace – from digital transformation, to marketing, to customer centricity, to population health, and more. This webinar will help demystify Data Strategy and its relationship to Data Architecture and will provide concrete, practical ways to get started.
Uncover how your business can save money and find new revenue streams.
Driving profitability is a top priority for companies globally, especially in uncertain economic times. It's imperative that companies reimagine growth strategies and improve process efficiencies to help cut costs and drive revenue – but how?
By leveraging data-driven strategies layered with artificial intelligence, companies can achieve untapped potential and help their businesses save money and drive profitability.
In this webinar, you'll learn:
- How your company can leverage data and AI to reduce spending and costs
- Ways you can monetize data and AI and uncover new growth strategies
- How different companies have implemented these strategies to achieve cost optimization benefits
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
Organizations with governed metadata made available through their data catalog can answer questions their people have about the organization’s data. These organizations get more value from their data, protect their data better, gain improved ROI from data-centric projects and programs, and have more confidence in their most strategic data.
Join Bob Seiner for this lively webinar where he will talk about the value of a data catalog and how to build the use of the catalog into your stewards’ daily routines. Bob will share how the tool must be positioned for success and viewed as a must-have resource that is a steppingstone and catalyst to governed data across the organization.
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
Organizations with governed metadata made available through their data catalog can answer questions their people have about the organization’s data. These organizations get more value from their data, protect their data better, gain improved ROI from data-centric projects and programs, and have more confidence in their most strategic data.
Join Bob Seiner for this lively webinar where he will talk about the value of a data catalog and how to build the use of the catalog into your stewards’ daily routines. Bob will share how the tool must be positioned for success and viewed as a must-have resource that is a steppingstone and catalyst to governed data across the organization.
In this webinar, Bob will focus on:
-Selecting the appropriate metadata to govern
-The business and technical value of a data catalog
-Building the catalog into people’s routines
-Positioning the data catalog for success
-Questions the data catalog can answer
Because every organization produces and propagates data as part of their day-to-day operations, data trends are becoming more and more important in the mainstream business world’s consciousness. For many organizations in various industries, though, comprehension of this development begins and ends with buzzwords: “Big Data,” “NoSQL,” “Data Scientist,” and so on. Few realize that all solutions to their business problems, regardless of platform or relevant technology, rely to a critical extent on the data model supporting them. As such, data modeling is not an optional task for an organization’s data effort, but rather a vital activity that facilitates the solutions driving your business. Since quality engineering/architecture work products do not happen accidentally, the more your organization depends on automation, the more important the data models driving the engineering and architecture activities of your organization. This webinar illustrates data modeling as a key activity upon which so much technology and business investment depends.
Specific learning objectives include:
- Understanding what types of challenges require data modeling to be part of the solution
- How automation requires standardization on derivable via data modeling techniques
- Why only a working partnership between data and the business can produce useful outcomes
Analytics play a critical role in supporting strategic business initiatives. Despite the obvious value to analytic professionals of providing the analytics for these initiatives, many executives question the economic return of analytics as well as data lakes, machine learning, master data management, and the like.
Technology professionals need to calculate and present business value in terms business executives can understand. Unfortunately, most IT professionals lack the knowledge required to develop comprehensive cost-benefit analyses and return on investment (ROI) measurements.
This session provides a framework to help technology professionals research, measure, and present the economic value of a proposed or existing analytics initiative, no matter the form that the business benefit arises. The session will provide practical advice about how to calculate ROI and the formulas, and how to collect the necessary information.
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
Data Mesh is a trending approach to building a decentralized data architecture by leveraging a domain-oriented, self-service design. However, the pure definition of Data Mesh lacks a center of excellence or central data team and doesn’t address the need for a common approach for sharing data products across teams. The semantic layer is emerging as a key component to supporting a Hub and Spoke style of organizing data teams by introducing data model sharing, collaboration, and distributed ownership controls.
This session will explain how data teams can define common models and definitions with a semantic layer to decentralize analytics product creation using a Hub and Spoke architecture.
Attend this session to learn about:
- The role of a Data Mesh in the modern cloud architecture.
- How a semantic layer can serve as the binding agent to support decentralization.
- How to drive self service with consistency and control.
Enterprise data literacy. A worthy objective? Certainly! A realistic goal? That remains to be seen. As companies consider investing in data literacy education, questions arise about its value and purpose. While the destination – having a data-fluent workforce – is attractive, we wonder how (and if) we can get there.
Kicking off this webinar series, we begin with a panel discussion to explore the landscape of literacy, including expert positions and results from focus groups:
- why it matters,
- what it means,
- what gets in the way,
- who needs it (and how much they need),
- what companies believe it will accomplish.
In this engaging discussion about literacy, we will set the stage for future webinars to answer specific questions and feature successful literacy efforts.
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
Change is hard, especially in response to negative stimuli or what is perceived as negative stimuli. So organizations need to reframe how they think about data privacy, security and governance, treating them as value centers to 1) ensure enterprise data can flow where it needs to, 2) prevent – not just react – to internal and external threats, and 3) comply with data privacy and security regulations.
Working together, these roles can accelerate faster access to approved, relevant and higher quality data – and that means more successful use cases, faster speed to insights, and better business outcomes. However, both new information and tools are required to make the shift from defense to offense, reducing data drama while increasing its value.
Join us for this panel discussion with experts in these fields as they discuss:
- Recent research about where data privacy, security and governance stand
- The most valuable enterprise data use cases
- The common obstacles to data value creation
- New approaches to data privacy, security and governance
- Their advice on how to shift from a reactive to resilient mindset/culture/organization
You’ll be educated, entertained and inspired by this panel and their expertise in using the data trifecta to innovate more often, operate more efficiently, and differentiate more strategically.
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
With technological innovation and change occurring at an ever-increasing rate, it’s hard to keep track of what’s hype and what can provide practical value for your organization. Join this webinar to see the results of a recent DATAVERSITY survey on emerging trends in Data Architecture, along with practical commentary and advice from industry expert Donna Burbank.
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
As DATAVERSITY’s RWDG series hurdles into our 12th year, this webinar takes a quick look behind us, evaluates the present, and predicts the future of Data Governance. Based on webinar numbers, hot Data Governance topics have evolved over the years from policies and best practices, roles and tools, data catalogs and frameworks, to supporting data mesh and fabric, artificial intelligence, virtualization, literacy, and metadata governance.
Join Bob Seiner as he reflects on the past and what has and has not worked, while sharing examples of enterprise successes and struggles. In this webinar, Bob will challenge the audience to stay a step ahead by learning from the past and blazing a new trail into the future of Data Governance.
In this webinar, Bob will focus on:
- Data Governance’s past, present, and future
- How trials and tribulations evolve to success
- Leveraging lessons learned to improve productivity
- The great Data Governance tool explosion
- The future of Data Governance
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
Would you share your bank account information on social media? How about shouting your social security number on the New York City subway? We didn’t think so either – that’s why data governance is consistently top of mind.
In this webinar, we’ll discuss the common Cloud data governance best practices – and how to apply them today. Join us to uncover Google Cloud’s investment in data governance and learn practical and doable methods around key management and confidential computing. Hear real customer experiences and leave with insights that you can share with your team. Let’s get solving.
Topics that you will hear addressed in this webinar:
- Understanding the basics of Cloud Incident Response (IR) and anticipated data governance trends
- Best practices for key management and apply data governance to your day-to-day
- The next wave of Confidential Computing and how to get started, including a demo
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the enterprise mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and data architecture. William will kick off the fifth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
Too often I hear the question “Can you help me with our data strategy?” Unfortunately, for most, this is the wrong request because it focuses on the least valuable component: the data strategy itself. A more useful request is: “Can you help me apply data strategically?” Yes, at early maturity phases the process of developing strategic thinking about data is more important than the actual product! Trying to write a good (must less perfect) data strategy on the first attempt is generally not productive –particularly given the widespread acceptance of Mike Tyson’s truism: “Everybody has a plan until they get punched in the face.” This program refocuses efforts on learning how to iteratively improve the way data is strategically applied. This will permit data-based strategy components to keep up with agile, evolving organizational strategies. It also contributes to three primary organizational data goals. Learn how to improve the following:
- Your organization’s data
- The way your people use data
- The way your people use data to achieve your organizational strategy
This will help in ways never imagined. Data are your sole non-depletable, non-degradable, durable strategic assets, and they are pervasively shared across every organizational area. Addressing existing challenges programmatically includes overcoming necessary but insufficient prerequisites and developing a disciplined, repeatable means of improving business objectives. This process (based on the theory of constraints) is where the strategic data work really occurs as organizations identify prioritized areas where better assets, literacy, and support (data strategy components) can help an organization better achieve specific strategic objectives. Then the process becomes lather, rinse, and repeat. Several complementary concepts are also covered, including:
- A cohesive argument for why data strategy is necessary for effective data governance
- An overview of prerequisites for effective strategic use of data strategy, as well as common pitfalls
- A repeatable process for identifying and removing data constraints
- The importance of balancing business operation and innovation
Who Should Own Data Governance – IT or Business?DATAVERSITY
The question is asked all the time: “What part of the organization should own your Data Governance program?” The typical answers are “the business” and “IT (information technology).” Another answer to that question is “Yes.” The program must be owned and reside somewhere in the organization. You may ask yourself if there is a correct answer to the question.
Join this new RWDG webinar with Bob Seiner where Bob will answer the question that is the title of this webinar. Determining ownership of Data Governance is a vital first step. Figuring out the appropriate part of the organization to manage the program is an important second step. This webinar will help you address these questions and more.
In this session Bob will share:
- What is meant by “the business” when it comes to owning Data Governance
- Why some people say that Data Governance in IT is destined to fail
- Examples of IT positioned Data Governance success
- Considerations for answering the question in your organization
- The final answer to the question of who should own Data Governance
It is clear that Data Management best practices exist and so does a useful process for improving existing Data Management practices. The question arises: Since we understand the goal, how does one design a process for Data Management goal achievement? This program describes what must be done at the programmatic level to achieve better data use and a way to implement this as part of your data program. The approach combines DMBoK content and CMMI/DMM processes – permitting organizations with the opportunity to benefit from the best of both. It also permits organizations to understand:
- Their current Data Management practices
- Strengths that should be leveraged
- Remediation opportunities
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
MLOps is a practice for collaboration between Data Science and operations to manage the production machine learning (ML) lifecycles. As an amalgamation of “machine learning” and “operations,” MLOps applies DevOps principles to ML delivery, enabling the delivery of ML-based innovation at scale to result in:
Faster time to market of ML-based solutions
More rapid rate of experimentation, driving innovation
Assurance of quality, trustworthiness, and ethical AI
MLOps is essential for scaling ML. Without it, enterprises risk struggling with costly overhead and stalled progress. Several vendors have emerged with offerings to support MLOps: the major offerings are Microsoft Azure ML and Google Vertex AI. We looked at these offerings from the perspective of enterprise features and time-to-value.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Unstructured Data and the Enterprise
1. Second Quarter 2012
Headline
subhead
Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur-
rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa-
tion from the immensity of corporate email.
Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc-
tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches
and enhanced reporting.
Unstructured Data
and the Enterprise
by Paul Williams
Collaborated with Christine Connors
™
dataversity.net 1
3. ™
Executive Summary
In its most basic definition, unstructured data simply means any form of data that does not easily fit into a relational model or a set of
database tables. Unstructured data exists in a variety of formats: books, audio, video, or even a collection of documents. In fact, some
of this data may very well contain a measure of structure, such as chapters within a novel or the markup on a HTML Web page, but
not a full data model typical of relational databases.
➺ Anywhere from 40 to 80 percent of an enterprise’s stored data currently resides in an unstructured format. While most
unstructured data is in various text formats, other formats include audio, video, Web pages, and office software data.
➺ Firms able to successfully capture and manage their unstructured data hold a competitive advan-
tage over firms unable to do the same. Over 90 percent of enterprises are currently planning to manage unstructured data or are
already doing it.
➺ Metadata markup, text analytics, data mining, and taxonomy creation are all industry-proven tech-
niques used to capture an enterprise’s unstructured data. Included case studies show these techniques in action as part of success-
ful problem-solving projects.
➺ Difficulty in integration with enterprise systems along with the immaturity of the currently available unstructured data
management tools are two major barriers to the successful management of unstructured data.
Many industry pundits claim that 80 to 85 percent of all enterprise data resides in an unstructured format. Some studies counter that
statistic, arguing the true percentage is significantly less. DATAVERSITY’s own 2012 survey reflects a percentage below the common-
ly stated 80 percent. No matter the actual percentage compared to structured formats, there is little doubt the amount of unstructured
data continues to grow.
Gartner Research, one organization quoting the 80 percent metric for unstructured data, predicts a nearly 800 percent growth in the
amount of enterprise data over the next 5 years. The large majority of this data is expected to be unstructured. This data growth is one
factor driving corporate investments in Big Data, Cloud Computing, and NoSQL.
dataversity.net 1
4. The Growing Industry
around Unstructured Data
A large industry has grown around the task of deriving valuable business information
out of unstructured data. With parsers scraping information out of pages and pages of
What industry
text, in addition to full systems built around data taxonomy and discovery, many op- are you in?
tions exist for any enterprise trying to make sense of their unstructured information. 0% 5% 10% 15% 20%
TECHNOLOGY/COMPUTERS/SOFTWARE
Commercially available Content Management Systems (CMS) use metadata to pro- GOVERNMENT
vide more accurate searching of an organization’s document library. In fact, metadata
OTHER (PLEASE SPECIFY)
IT CONSULTING SERVICES
remains a very important tool in the handling of any unstructured data. Business Intel- EDUCATION
ligence system providers also offer mature solutions around making sense of the vast
SOFTWARE/SYSTEM DEVELOPMENT
HEALTHCARE
arrays of an enterprise’s seemingly unrelated information. TELECOMMUNICATIONS
FINANCIAL SERVICES
INSURANCE
Despite any associated costs, enterprises better able to leverage meaningful Business MANUFACTURING
Intelligence from unstructured data gain a competitive advantage over those compa-
RETAIL/WHOLESALE/DISTRIBUTION
ENTERTAINMENT
nies who cannot. MEDIA
MARKETING
PUBLISHING
Once that data is successfully under management, finding the best techniques and ap-
plications for leveraging the data remains key in deriving a competitive advantage for Graph #1
any organization.
Current Usage of Unstructured number of employees
Data in the Enterprise in your company?
DATAVERSITY recently sent out a survey to its readership base covering a host of
0% 5% 10% 15% 20% 25%
topics related to unstructured data in the enterprise. With nearly 400 respondents, the LESS THAN 10
answers provide a reliable cross section of industry types and organizational sizes. 10-100
100-1,000
The survey results provide a unique insight as to how organizations perceive their
1,000-5,000
5,000-10,000
unstructured data problem, what steps they are taking to derive meaningful business 10,000-50,000
information from that data, what formats the data resides in, and finally some demo- OVER 50,000
graphic information about their job function, organization, and industry.
Graph #2
Referencing the organizational demographics of the survey respondents provides a
sense of the types and sizes of the companies attempting to manage unstructured data.
In short, a wide cross section of organizational sizes and industries are currently man-
aging or considering implementing projects to manage unstructured data. See Graphs
One and Two below:
dataversity.net 2
5. Documents are the unstructured data type most commonly under management or considered for management, followed closely by
emails, presentations, and scanned documents. Various media formats (images, audio, and video) and social media chatter are also
important. See Graph Three below:
What Types of Unstructured Data Does Your Organization
Currently Manage, or is Considering for Management?
0% 20% 40% 60% 80% 100%
DOCUMENTS
EMAIL
SCANNED DOCUMENTS OR IMAGES
PRESENTATIONS
SOCIAL MEDIA CHATTER
DRAWING/GRAPHICS
VIDEO
AUDIO
PHOTOGRAPHS
GEOSPATIAL IMAGES
NONE - WE HAVE NO PLANS
OTHER (PLEASE SPECIFY) Graph #3
Most organizations hope to improve business efficiencies and reduce costs at their organization through the successful management
of unstructured data. Additional potential benefits from unstructured data management include the improvement of communication,
deriving sales leads, meeting compliance requirements, improving data governance, improving enterprise search, parsing and analyz-
ing social media channels, and Business Intelligence system integration.
The main barriers to improving unstructured data at the enterprise reflect the growing maturity of the practice. Two major barriers to
the successful management of unstructured data are difficulty in integration with enterprise systems along with the immaturity of the
currently available unstructured data management tools. Cost, lack of executive support, and a lack of unstructured data knowledge
among IT staff are also major barriers. See Graph Four on the following page.
what are the main barriers to effectively improving the management of
unstructured data within your company?
0% 10% 20% 30% 40% 50%
DIFFICULTY OF INTEGRATING UNSTRUCTURED DATA WITH OTHER ENTERPRISE SYSTEMS
LACK OF MATURE TOOLS/SYSTEMS FOR ENTERPRISE MANAGEMENT
REQUIREMENTS ARE NOT WELL DEFINED
SHORTAGE OF SKILLS/KNOWLEDGE AMONGST EXISTING IT STAFF
COMPLEXITY OF SOLUTIONS
HIGH COST OF AVAILABLE TECHNOLOGIES AND SOLUTIONS
TOO MUCH TO DO, TOO LITTLE TIME
LACK OF EXECUTIVES OR BUSINESS-LEVEL SUPPORT
BUSINESS CASE IS NOT UNDERSTOOD, OR NOT SUFFICIENTLY COMPELLING
LACK OF AUTOMATED TOOLS
LACK OF GOVERNANCE SUPPORT IN TOOLS/SYSTEMS
Graph #4
dataversity.net 3
6. System integration with existing structured data appears to be the key technology challenge of the unstructured data management
process across most industries (see Graphic Five below). Determining project requirements is another key challenge, especially in the
insurance and entertainment industries, as well as improving semantic consistency between multiple systems. Other relevant chal-
lenges include security, taxonomy development, scalability, and search result integration across multiple platforms.
What are the key technology challenges you have encountered when
attempting to build systems or applications for managing
Unstructured Data?
IT Consulting Services
Retail / Wholesale /
Computers / Software
Telecommunications
Software / System
Financial Services
Manufacturing
Industry Other
Entertainment
Development
Government
Technology /
Distribution
Healthcare
Marketing
Publishing
Education
Insurance
Media
Industry
Determining Project Requirements 55.9% 47.1% 32.1% 26.7% 33.3% 55.6% 16.7% 33.3% 37.5% 25.0% 50.0% 75.0% 100.0% 35.7% 25.0% 47.4%
Methods for integrating with
58.8% 64.7% 56.6% 26.7% 41.7% 50.0% 33.3% 73.3% 62.5% 100.0% 100.0% 75.0% 50.0% 39.3% 20.0% 47.7%
existing structured data
Natural language
11.8% 29.4% 17.0% 13.3% 20.8% 27.8% 33.3% 13.3% 25.0% 0.0% 25.0% 25.0% 50.0% 32.1% 40.0% 18.4%
understanding
Entity and/or concept extraction and
29.4% 23.5% 35.8% 13.3% 20.8% 22.2% 33.3% 0.0% 31.3% 0.0% 50.0% 25.0% 50.0% 25.0% 30.0% 26.3%
analysis
Filtering / summarizing relevant
23.5% 23.5% 35.8% 40.0% 33.3% 27.8% 33.3% 60.0% 25.0% 50.0% 25.0% 0.0% 50.0% 21.4% 40.0% 34.2%
information
Semantic consistency across multiple
47.1% 23.5% 41.5% 33.3% 37.5% 44.4% 0.0% 26.7% 37.5% 0.0% 75.0% 75.0% 50.0% 42.9% 40.0% 36.8%
systems
High storage requirements 17.6% 17.6% 28.3% 26.7% 25.0% 33.3% 66.7% 26.7% 25.0% 25.0% 25.0% 25.0% 50.0% 10.7% 35.0% 28.9%
Poor system performance
11.8% 23.5% 26.4% 20.0% 16.7% 22.2% 16.7% 20.0% 25.0% 0.0% 25.0% 0.0% 100.0% 21.4% 20.0% 15.8%
at large scale
System modeling and design 17.6% 29.4% 18.9% 13.3% 20.8% 33.3% 16.7% 40.0% 37.5% 25.0% 75.0% 25.0% 50.0% 14.3% 30.0% 21.1%
Digital Rights management 20.6% 5.9% 9.4% 6.7% 12.5% 5.6% 0.0% 6.7% 0.0% 0.0% 25.0% 0.0% 50.0% 7.1% 5.0% 5.3%
Taxonomy development 35.3% 23.5% 24.5% 0.0% 25.0% 11.1% 0.0% 40.0% 37.5% 0.0% 50.0% 50.0% 50.0% 25.0% 25.0% 23.7%
Integrating search results across
23.5% 23.5% 26.4% 20.0% 16.7% 27.8% 0.0% 26.7% 56.3% 25.0% 75.0% 0.0% 50.0% 21.4% 25.0% 47.4%
multiple platforms
Security 32.4% 35.3% 17.0% 53.3% 37.5% 16.7% 16.7% 26.7% 31.3% 0.0% 25.0% 50.0% 50.0% 28.6% 15.0% 28.9%
Information Quality problems 32.4% 17.6% 18.9% 40.0% 20.8% 44.4% 0.0% 20.0% 31.3% 25.0% 0.0% 25.0% 50.0% 25.0% 45.0% 42.1%
dataversity.net 4
7. Unstructured Data Formats
Text Documents
Various permutations of text (word processing files, simple text files, emails etc.) make up the largest amount of unstructured data cur-
rently in the enterprise. Many firms are in the process of implementing unstructured data management projects to find useful informa-
tion from the immensity of corporate email.
Content Management Systems exist partially to help an enterprise manage and derive information from the data contained in unstruc-
tured text documents. Most of these systems leverage metadata to provide an extra layer of classification allowing for easier searches
and enhanced reporting.
Web Pages
Web pages are unique in the world of unstructured data. In fact, an argument can be made that HTML markup itself provides a mea-
sure of structure. Web sites that are primarily data-driven might even use a fully normalized database as their back end. In addition to
pure HTML markup, many of these Web-based applications are written in Java, PHP, .NET and other development frameworks that
render HTML output.
The proprietary algorithms utilized by search engine providers use HTML meta tags in addition to in-text keyword analysis to help
tailor search results for the user. Of course, some of that same logic also works in generating Internet advertising for Web surfers, with
the ad revenue leading to financial success for the advertising providers.
The rapid growth of social media and the interactive Web is also creating large amounts of unstructured data as well as opportunities
for those companies able to manage and derive value from that data. The rising popularity of graph databases optimized for finding
relationships between social media users and their consumer preferences (along with others in their social networks) is arguably due to
companies hoping to leverage revenue from unstructured data analysis.
Media Formats (Audio, Video, Images)
Audio, video, and image files are all forms of unstructured data. Intelligent real-time analysis of audio data is commonplace in digital
audio recording and processing, and also used to a lesser extent with the other two formats.
Metadata is a must with media files; it provides necessary additional classification. A trip around the iTunes music store provides an
excellent opportunity to see this metadata in action, as band names, genres, and related artists drive Apple’s Genius and other music
recommendation services like the Pandora Internet radio station.
Office Software Data Formats (PowerPoint, MS Project)
Files created using Microsoft Office or any other office suite run the gamut of data formats. Microsoft Access creates and manages
fully structured database files in its own format. Excel and PowerPoint files both provide challenges to organizations looking to
include information from these formats in their corporate reporting mechanisms. Microsoft’s VBA language provides a measure of
functionality in parsing meaningful information from Office files.
Commercial applications with proprietary data formats include various customer management (CRM) tools, larger enterprise resource
planning (ERP) applications like SAP, or even architectural drafting applications like AutoCAD. In some cases, this software lever-
ages a semi-structured data format such as XML for data exchange between applications.
dataversity.net 5
1
8. Capturing and Managing
Unstructured Data
Before any enterprise can derive meaningful information from the mass of unstructured data, the process to capture that data needs
consideration. Data scraping or text parsing involves the extraction of information from data at its most basic level, but other methods
and tools provide a more measured approach.
Data Scraping and Text Parsing
Data Scraping is a technique where human-readable information is extracted from a computer system by another program. It is impor-
tant to distinguish the “human readable” aspect of data scraping from the typical data exchange between computers which can involve
structured formats.
The technique first came into vogue with screen scraping, used during the advent of client-server computing, as many organizations
struggled with the volume of data residing in legacy mainframe applications. The data from these mainframe apps, parsed from termi-
nal screens, was usually imported into some form of reporting or Business Intelligence application. Data Scraping can also be neces-
sary when attempting to interface to any legacy system without an application programming interface (API).
In recent times, Web scraping has followed a similar technique as screen scraping, allowing meaningful data from Web pages to be
parsed, scrubbed, and stored in a relational database. There are currently many applications using some form of Web scraping, espe-
cially in the areas of Web mashup and Web site integration, although in many cases APIs provide a more structured process.
Report Mining is similar to data and Web scraping in that it involves deriving meaningful information from a collection of static,
human-readable reports. This technique can also be useful in an enterprise’s software development QA process, such as facilitating
regression test result analysis.
Data Scraping and Text Parsing Case Study
A Fledgling Financial Services Company Leverages
Technology; Plays with the Big Boys
In the mid-1990s, equipped with investment dollars, ARM Financial Group went on a buying binge, acquiring a collection of small
insurance companies that specialized in retirement products, such as annuities and structured settlements. Its short-term quest focused
on rapidly increasing the dollar value of its assets under management and then profiting from improved efficiencies around the admin-
istration of these assets.
Getting accounting systems under control is normally the first step when acquiring any company in the financial sector. This is usually
followed by bringing the policy administration systems online with the new company. ARM faced a dilemma concerning the older
mainframe systems from a newly acquired firm.
The company made the choice to go client-server for their annuity processing systems; they were one of the first firms in the insurance
sector to do so. While massive projects grew around converting policy records from the old mainframe system to the new client-serv-
er based system, a more elegant solution was quickly developed to handle the customer service role.
dataversity.net 6
1
9. Screen scraping was used to grab policy data from mainframe terminals and store the data in a simple relational database, providing
customer service agents with a means to access policy data when on the phone with policyholders. This screen scraping project was
online months before all the policy data were converted and stored into the new client-server system’s database.
While the policy data in the older mainframe system was in a structured format, the use of screen scraping allowed the rapid develop-
ment of a needed solution for the customer service function at the new company.
As ARM continued to grow, improving operational efficiency became vital in maximizing profits. Despite the company’s state-of-the-
art, Web-based system for new annuities, nearly all business was in the form of paper -- both for new policies and changes to existing
policies.
Implementing an imaging and workflow system was crucial in re-engineering processes for controlling all aspects of an annuity’s
policy lifecycle. With most imaging systems, automated text parsing is vital in grabbing information from a document and storing
it in some form of database. The workflow elements of the new imaging system at the financial services company were also highly
dependent on the manual scraping of important metadata from essentially unstructured paper policy documents.
This metadata was essential when routing a policy through the processing workflows at the company. The imaging and workflow sys-
tem also used a SQL Server database to persist important data captured from the policy. While some of the policy data was the same
as what was stored in the annuity administration system, having it also stored in the workflow system with a SQL Server back end
allowed the rapid development of applications in Visual Basic to support customer service and policy management functions.
These new systems proved to be an early form of Business Intelligence application made possible by the leveraging of both manual
and automated screen scraping and text parsing to derive useful information from a mass of unstructured paper documents.
In the late 1990s the concepts around what is now called unstructured data were still in their infancy. The technological successes of a
small financial services company in pioneering insurance processing allowed it to compete with much larger firms. These successes
had a lot to do with recognizing what concepts like screen scraping and text parsing could do to make annuity processing more ef-
ficient by allowing the rapid development of new systems to support both customer service and management roles.
Metadata
Sometimes defined as “data about data,” metadata is often a useful tool when dealing with unstructured data residing in text docu-
ments. In some cases, metadata serves a similar role as a data dictionary by describing the content of data, while other times it is used
in more of a markup or tagging purpose.
While some metadata is embedded directly in the document it describes, metadata can also be stored in a database or some other re-
pository. Many enterprises build a formal metadata registry with limited write access. This registry can be shared internally or exter-
nally through a Web service interface.
Many metadata standards specific to various industries have developed over time. One example of this is the Dublin Core Metadata
Initiative (DCMI), used extensively in library sciences and other disciplines for the purposes of online resource discovery.
Metadata schema syntax can be in a variety of text or markup formats, including HTML, XML, or even plain text. Both ANSI and ISO
are active in developing and enforcing standards for expressing metadata syntax.
A type of metadata, meta tags are used in HTML to mark a Web page with words describing the content of the page. In the past, search
engines relied mostly on meta tags when building result sets based on a search query, although their relevance has waned in recent
times.
dataversity.net
dataversity.net 7
1
10. METADATA CASE STUDY
Leveraging Metadata to Improve Unstructured
Document Searching
The Education Department of a large Midwestern state faced a problem: the state’s teachers needed a system to assist them in organiz-
ing appropriate instructional materials for the classroom. The materials primarily resided as documents in the Education Department’s
Documentum system. The system had a small amount of metadata in it, but there was no functional tool that allowed the easy search-
ing and collecting of that content.
The acquisition of a Google Search Appliance device promised to provide some measure of search functionality, but Google’s inter-
face is predicated on one single text box allowing for full-text search without the ability to narrow or filter those search results. The
teachers needed the ability to search by criteria such as grade level, subject area, or the type of content (lesson plans, content stan-
dards, etc.).
Developing a solution for the state relied on a multi-tier approach that first built a more robust Web search interface that added catego-
rized collections of check boxes for the narrowing of search results, in addition to Google’s standard text box for full-text searching.
A school backpack metaphor was used as the “shopping cart” for this Web-based application. As a teacher searched through the
instructional materials, anything they wanted to use in the classroom was simply saved in their backpack for later retrieval. Different
backpacks could be saved for each teacher depending on their needs.
This front-end search interface and persistent backpack model would not function without improving the metadata stored on each
document in Documentum. A side project involved retrofitting each instructional material document with the relevant metadata for
grade level, subject, and content type.
Luckily, Google’s Search Appliance API allowed additional data to be passed into the search request. Testing revealed the combination
of full-text search with additional filtering provided by the Web interface and enhanced metadata returned the relevant instructional
materials in the search result set.
While Microsoft’s IIS and Active Server Pages were the preferred Web development solution at the Education Department, they were
a few years behind in fully implementing the .NET framework. Because of that, the project became an interesting hybrid of Classic
ASP at the front end with .NET code used to provide the backpack storage functionality, along with the Web service interface for the
Google Search Appliance.
Considering the additional search criteria, it was determined the standard Web page output of the Google Search Appliance was not
sufficient to serve the needs of the application. The appliance’s search result set was available in XML format, so back-end code was
written to enhance the output with graphics, a detailing of the search criteria, and a link to add the specific instructional material to
the teacher’s backpack. This additional output combined nicely with the standard document summary normally provided by Google’s
search.
Even though a variety of pieces went into creating the best solution for teachers, leveraging improved metadata supplied the project’s
ultimate success. Straight out of the box, the Google Search Appliance did not provide the necessary search result filtering needed to
return the correct instructional material from a mass of unstructured documents. By enhancing the metadata stored in Documentum,
Google’s search functionality greatly improved, and the rest of the project was able to proceed.
dataversity.net 8
1
11. Taxonomy
While the term taxonomy is traditionally related to the world of biological species classification, it also plays a similar role in classify-
ing terms for any number of subjects. Information Management taxonomies play a vital role in making sense out of unstructured data,
primarily as a method for organizing metadata.
Taxonomies in IT usually take one of two forms. The first form draws from the species classification origin of the term taxonomy, fol-
lowing a hierarchical tree structure model. Individual terms in this kind of taxonomy have “parents” at the higher levels and “chil-
dren” at corresponding lower levels.
The second taxonomy form is essentially a controlled vocabulary of the terms surrounding any subject matter or system. This might
take the form of a simple glossary or thesaurus, or something more complex and resource intensive, like the creation of a fully-formed
ontology. This second type of taxonomy tends to be more common in the world of information technology.
The ANSI/NISO Z39.19 standard exists for the authoring of taxonomies, information thesauruses, and other organized data dictionar-
ies, illustrating the growing maturity of this data management discipline. Over the last decade, new companies and software packages
centered on taxonomy management have come and gone, with a few earning acclaim as industry leaders, sometimes with taxonomies
dealing with a specific subject matter.
Ultimately, taxonomies, controlled vocabularies, and their similar brethren serve an enterprise by organizing metadata in a fashion that
helps in finding valuable business information out of unstructured data.
taxonomy CASE STUDY #1
Improving Corporate Intranet Search through the
Development and Deployment of Taxonomies
In the middle of the last decade, a large publically-held electronics company had a problem with unstructured information, estimated
to be about 85 percent of all corporate data. In addition, over 90 percent of the corporate data contained no tagging. Issues with the
duplication of content and determining the true age of documents also hampered the company’s associates when searching for infor-
mation.
To get a clearer picture of the scope of the problem, this company surveyed its employees on their Intranet search habits. The respons-
es demonstrated that the employees wanted a better search interface with more categorization and sorting options, along with a more
streamlined search result set. Some respondents felt frustrated with the difficulty in finding corporate documents through the current
search interface.
A corporate project team was formed with the responsibility of improving Intranet search. The core of this project was the develop-
ment of various taxonomies and controlled vocabularies combined with metadata to improve the categorization of the internal unstruc-
tured information; it was meant to provide benefits for the search interface and in the search results.
Five taxonomies were developed and deployed in the first year of the project. Some were purchased externally and modified to fit the
data at this corporation, while the others were fully developed from within. In all cases, certain employees were tasked as Subject Mat-
ter Experts in their relevant areas for the purposes of creating the most suitable vocabularies and metadata for the taxonomies.
The second year of the project saw the deployment of three additional taxonomies covering the areas of Human Resource, Six Sigma,
and Legal. Two taxonomies were purchased externally and modified to suit, while the other was developed internally. Additionally,
some improvement of the original categorization was done based on feedback after the previous year’s deployments.
dataversity.net 9
1
12. The benefits of the taxonomy development were obvious; they significantly improved all aspects of Intranet search: general employees
and electronic engineers wasted less time weeding through bad search results, positively improving productivity, and the company’s
bottom line. With the reduction of duplicated data, storage costs were improved, even when considering new data growth.
Post project surveys and metrics revealed increased use of the improved search and the use of the categories. The general employee
opinions on search improved. Ultimately, the internal development team won an award for their work on the project.
taxonomy CASE STUDY #2
Folksonomies:
A Hybrid Approach to Taxonomy Development
The creation of taxonomies in any organization comes with associated costs, especially in the case of formal taxonomies where exter-
nal consultants and Subject Matter Experts are involved with the process. In some cases, informal taxonomies exhibit smaller costs,
but with more risk considering the relative lack of taxonomy creation experience compared with a more formalized process.
A hybrid approach utilizes experts in taxonomy creation combined with a user-centric focus on the knowledge modeling for the
project. It attempts to lessen the costs associated with a formal taxonomy, while still providing an organized development process. The
term “folksonomy” is used to describe this more user-centric approach.
Folksonomies leverage crowd-sourced wisdom on the relevant subject matter at hand, an approach not too different from social book-
marking Web sites like Digg or Reddit, or even a blog community focused around a specific subject.
In these kinds of hybrid taxonomy projects, users are able to submit Web sites and/or tags they would like to see included in the over-
all vocabulary. An expert taxonomy team reviews these submissions for appropriateness and uniqueness. Finally, the submissions end
up persisted in XML format for inclusion in the enterprise search process.
The sharing of these internal social bookmarks and tags enhances the quality of enterprise search and also provides insight to the
perception of an organization’s internally facing content. Users feel they are an important stakeholder in the process, thus improving
company morale.
Implementing a folksonomy normally involves installing some form of tagging tool, usually available as a module for an open-source
CMS like Drupal or WordPress. A quality reporting system is also important in determining the efficacy of the tagging, in addition to
providing metrics on how the internal content is being used.
In many cases, a hybrid approach to taxonomy development hits a proverbial sweet spot in combining ROI with lower costs when
compared to a full-fledged formal taxonomy. For smaller organizations, with a limited budget, it is an approach worthy of consider-
ation.
eDiscovery (Electronic Discovery)
eDiscovery providers bring an automated search focus to leveraging valuable information from an organization’s unstructured data.
They are similar to discovery systems used primarily in the library science and research industries. A separate section covering library
discovery systems follows this current section.
eDiscovery is widely used in the legal industry as a means for finding any evidence or information potentially useful in a case. In fact,
dataversity.net 10
1
13. court sanctioned hacking of computer systems is a valid form of eDiscovery. Computer forensics, normally used when trying to find
deleted evidence off of a hard drive or other computer storage, is also related to eDiscovery.
Electronic discovery systems are able to search a variety of unstructured data formats, including text, media (images, audio, and
video), spreadsheets, email, and even entire Web sites. The best eDiscovery applications search everything from in-house server farms
to the full breadth of the Internet in their quest to find valuable evidentiary information. Investigators peruse the discovered informa-
tion in a variety of formats that run from printed paper to computer-based browsing.
When potential litigation is a concern, the protection of corporate emails, instant messaging, and even metadata attached to unstruc-
tured documents becomes crucial for any enterprise. This electronically stored information (ESI) became a focal point in changes to
Federal Law in 2006 and 2007 that required organizations to retain, protect, and manage this kind of data.
Yet, a 2010 study showed that only that while 52 percent of organizations have an ESI policy, only 38 percent have tested the policy,
and 45 percent aren’t even aware whether any testing occurred. Considering the risk of future litigation, firms need to focus on the
complete development, including testing, of a robust ESI management policy.
As the eDiscovery industry matures, larger companies are bringing the discovery function in house as opposed to relying on external
vendor-provided systems; other firms prefer a mix between internal and out-sourced solutions. Whatever their ultimate choice, firms
need to realize the vital importance of well-defined and documented procedures for ESI archives and eDiscovery platforms.
Discovery Systems
The library sciences industry also depends on electronic discovery systems to provide research and other functionality to their con-
sumers. An array of vendors and platforms has grown around this form of discovery, residing generally separate from the eDiscovery
sector in the legal industry. Recently, enterprise-based knowledge management initiatives have embraced the traditionally library-
based discovery process.
While discovery systems for library sciences have previously focused on search products for traditional content (e.g. books and peri-
odicals), in recent times their scope has broadened to include a wider range of material, including video, audio, and subscriptions to
electronic resources. Additionally, content from external providers is now able to be discovered in the search.
These discovery systems depend on the indexing of both document metadata along with the full text to provide a robust set of search
results for the user. Content providers benefit from enhanced exposure as well as being able to control the display and delivery of the
discovered materials. Obviously, cooperation between those creating the content and those creating the discovery system is paramount
to ensure the proper indexing of metadata.
With competing discovery system providers, those also producing content sometimes choose to not index their documents with a
competitor. This occurred when EBSCO removed its content from the Ex Libris Primo Central discovery system just before introduc-
ing its own EBSCO Discovery Service. Other pundits worry organizations that both produce content and a discovery service will skew
any search results to favor their own content. Consumers, including libraries, need to take into account these issues before subscribing
to any discovery system.
Text Analytics
Text analytics is another related discipline useful in deriving meaningful information from unstructured data. The term became widely
used around the year 2000 as a more formalized business-based outgrowth of text mining, a technique in use since the 1980s.
In addition to raw text mining, text analytics uses other more formalized techniques, including natural language processing, to turn un-
structured text into data more suitable for Business Intelligence and other analytical uses. Considering that a majority of unstructured
dataversity.net 11
1
14. data resides in a textual format, text analytics remains one of the most important techniques for making sense of unstructured data.
Professor Marti Hearst from the University of California, in her 1999 paper Untangling Text Data Mining, presciently described the
current practice of text analytics in today’s business climate:
“For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to
produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collec-
tions to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent
text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.”
Text analytics makes up the basis of some of the previously mentioned methods used in capturing unstructured data, most notably
discovery systems and eDiscovery. It is also employed by organizations to monitor social media for human resources applications or
personally targeted advertising.
Machine learning and semantic processing are two other sub-disciplines of text analytics at the forefront of innovation. IBM’s Watson
computer, famous for defeating Jeopardy! all-time champion Ken Jennings, is an apt example of the power of semantics at the higher
end of computer science.
In addition to natural language and semantic processing, a typical text analytics application might include other techniques or process-
es, such as named entity recognition, which is useful in finding common place names, stock ticker symbols, and abbreviations. Dis-
ambiguation methods can be applied to provide context when faced with different entities sharing the same name: Apple, the Beatles
record company, compared to Apple, the consumer electronics giant.
Regular expression matching is a straightforward process used in parsing phone numbers along with email and Web addresses from
unstructured text. Conversely, sentiment analysis is a more difficult technique, attempting to derive subjective information or human
opinion from text. Machine learning combined with sophisticated syntactic analysis techniques normally make up the basis of auto-
mated sentiment analysis.
Text analytics remains a wide-ranging disciplineused in both the business and scientific worlds. From leading edge applications like
IBM’s Watson to day-to-day tasks like parsing emails from text, it is an important part of capturing unstructured data.
dataversity.net 12
1
15. Written by Christine Connors
Integrating Unstructured Data
and Putting it to Use in your
Real World
Christine Connors, Principal, TriviumRLG LLC
Various permutations of text (word processing files, simple text files, emails etc.) make up the Christine Connors has extensive
largest amount of unstructured data currently in the enterprise. Many firms are in the process experience in taxonomy, ontology
of implementing unstructured data management projects to find useful information from the and metadata design and develop-
immensity of corporate email. ment. Prior to forming TriviumRLG
Ms. Connors was the global direc-
Content Management Systems exist partially to help an enterprise manage and derive informa- tor, semantic technology solutions
tion from the data contained in unstructured text documents. Most of these systems leverage for Dow Jones, responsible for
metadata to provide an extra layer of classification allowing for easier searches and enhanced partnering with business champions
reporting. across Dow Jones to improve digital
asset management and delivery.
Now that you are capturing and storing your data from unstructured sources, what can you do In that position, she managed a
with it? Where can you put it to good use? What new categories of applications are best suited worldwide team responsible for the
to exploit it? development of taxonomies, ontolo-
gies and metadata that are used to
add value to Dow Jones news and
Asset Valuation financial information products. Ms.
Your organization has created terabytes of intellectual property - what is its value? Value is Connors also served as business
a difficult thing to assess. A piece of information stored “just in case” today may or may not champion for the Synaptica® soft-
become the critical missing piece of the puzzle down the road. That is assuming of course that ware application, including manag-
ing a US- based team of software
you can both find and use that information.
developers, and supported Dow
Jones consulting practices world-
Applying structure to your data will enable two critical processes:
wide, which deliver end-to-end
1) De-duplication and ‘weeding’ of bad or unnecessary data
information access solutions based
2) Visualizing what remains
on taxonomies, metadata and
semantic technologies. Prior to join-
Once that weeding has occurred, the costs of data storage can be estimated based upon hard- ing Dow Jones Ms. Connors was a
ware and maintenance expenses. These costs are then weighed against the value of the digital knowledge architect at Intuit, where
assets. Value is determined by the alignment of the type and nature of the data against the she was responsible for introduc-
organization’s core goals. Is the data used as inputs to product or to measure the profits of the ing semantic technologies to online
products or services being delivered? How many degrees of separation are there? Does the content management and search.
data identify opportunities or threats in the marketplace? What are the predicted profits and And before that, she was a Meta-
potential losses? data Architect at Raytheon Com-
pany and Cybrarian at CEOExpress
Expertise Location Company. At Raytheon Company
she oversaw knowledge repre-
Consider all of the resumes your HR department has collected; they are a wealth of unstruc- sentation and enterprise search,
tured data regarding the talents of your employees. Using text analytics, a profile can be built delivering large-scale taxonomies,
of each person. The application of structure to such a collection of existing data means that metadata schema and rules-based
project heads can easily identify potential team members who have the experience needed classification to improve retrieval of
to tackle a particular challenge. Add to that the HR-approved information collected during petabytes of internal information via
performance reviews, and the team leader has a better handle on what strategies to employ a multi- vendor retrieval platform.
managing the team’s efforts.
dataversity.net 13
1
16. Business Intelligence
It’s always good to know how exterior forces can impact your organization. Perhaps your best customer has joined the board of a non-
profit organization -- a board on which an executive at your top competitor is already a member. How will that new network connec-
tion affect your relationship with your customer’s organization?
Or perhaps one of your top engineers has been asked by her alma mater to work with a lead professor, his students, and another
alumnus on a project that could benefit the school. Could you lose this top performer to a new venture? Is the research worth investing
in? Would you like to set up an alert, using a discovery system, to keep track of the internal memos and external press regarding the
project?
New Product Development
Your researchers and editors have crafted fabulous publications. System analysis reveals that your subscribers are only using bits and
pieces of this published work, reading a page or two, reading only the abstract, or going right to charts and visualizations. How can
you break out these sub-sections of content with minimal overhead? How can you start quickly by using existing content?
By identifying entities in your content, you can re-use or create new graphs, charts, and even user-filtered data visualizations. The enti-
ties are identified by analyzing the existing content, extracted or tagged, and then indexed for re-use.
Doing this work allows you to publish sub-sets of data within a single publication or across publications. You can use wholly owned,
licensed or a combination of content as contractually permitted. You can integrate your data in multi-media tools or social networking
sites.
Requirements for
Unstructured Data Projects
By Christine Connors, Principal, TriviumRLG LLC
As with any undertaking, requirements are needed for an unstructured data project. It isn’t about simply exposing the contents of the
documents. It is about making that content useful to the systems and people who need to use them. Or as many experts have said in
other applications: making the right content available to the right people at the right time.
There may be documents exposed that you didn’t know were there, that shouldn’t be publicly available, and are available because of
an error somewhere in the applications. Imagine the trouble if your new system indexed an HR spreadsheet with salaries, addresses,
and social security numbers, while being backed up onto a shared drive the user thought was secure?
Consider the content collections that will be part of the program. Do you anticipate any of it having restrictions? If so, then what are
those restrictions? How will authorized users authenticate and gain access? Will you restrict access by entity type? By rules-based
classification? By system access and control policies? These are important things to consider.
Given that you might find documents you weren’t expecting, how will you architect the back end to scale effectively? Will it be easily
repeated on additional clusters? What OS and software will it need to run? Will it fail over? Can it scale to handle the number of users,
documents, and entities predicted for the anticipated life of the hardware?
Once you’ve determined that, how will users interact? What will the front end need to provide? Typically users manage Create - Read
- Update - Delete rights as permissioned within a system. They also search, browse, publish, integrate, migrate, and import to and from
dataversity.net 14
17. other systems. What tools are needed to support these actions? Should select users be able to perform administrative tasks via a client
or browser interface? How about the ability to generate reports? What operating systems does this interface need to support?
Once you’ve got your content under control, how are you going to package and publish it? What other applications need to use the
data? What are the interoperability requirements?
The ongoing identification of your unstructured data is critical to avoid undertaking such a project again. One method is via Metadata
Management. What requirements do you have there? What kinds of information remain important to manage, in addition to the meta-
data elements? Will you need taxonomy? Do you need an external tool or is there a module within your current CMS, DMS or portal
solution that will suffice?
There are many questions here, but most of the overall process is not too different than anything led by a competent project manager.
These tasks can be completed in parallel or serially in combination with usability tests, surveys, and focus groups. Taxonomy develop-
ment, if needed, will benefit from the guidance of an expert, as it is not typically a linear process like that of software development.
Summing Up the Opportunities
Created by Unstructured Data
The existence of vast quantities of unstructured data at any organization is not necessarily a problem. In fact, it needs to be considered
as an opportunity for success. The various case studies contained within proved that projects focused on deriving information out of
seemingly unrelated data in many cases allowed firms with a proactive attitude to gain a competitive advantage when compared to
firms who fear or ignore such unstructured data.
This paper provided background on the various types of unstructured data along with a collection of time-honed techniques for captur-
ing and managing that data. The real world case studies provided inspiration in solving potential unstructured data issues. The list of
applications and organizations with tools related to unstructured data are an excellent starting point for researching the wide issues in
this sector of data management.
Additionally, the included survey data revealed that most firms are taking an active approach with unstructured data, so no one should
feel alone when considering their own data issues. It is a fascinating area of expertise that continues to change and evolve; there are
many opportunities for organizational success, personal success, and knowledge growth, with so many transformations occurring all
the time.
dataversity.net 15
18. Appendix: Open Source and Commercial
Applications around Unstructured Data
Teradata: Teradata produces enterprise Business Intelligence and Data Warehousing software. Their suite also provides functional-
ity that facilitates data extraction from various unstructured and structured sources into a proprietary relational database. The company
recently added text analytics (from Attensity) to its software suite to better analyze certain types of unstructured data, including text
documents and spreadsheets.
Teradata Corporation
10000 Innovation Drive
Dayton, OH 45342
Phone: (866) 548-8348
Attensity: Attensity specializes in the extraction of meaning from unstructured data through the use of text analytics. They recently
partnered with Data Warehousing provider, Teradata, to add text analysis to the latter’s software suite.
Attensity Group
2465 East Bayshore Road
Suite 300
Palo Alto, CA 94303
Phone: (650) 433-1700
DataStax: DataStax is known as a leader for commercial implementations of the Apache Cassandra database. Last year’s introduc-
tion of DataStax Enterprise combines Cassandra with the company’s OpsCenter product, all running on the Hadoop framework.
DataStax HQ - SF Bay Area
777 Mariners Island Blvd #510
San Mateo, CA 94404
Phone: (650) 389-6000
Taxonomy Providers
Smartlogic: Smartlogic is an enterprise taxonomy provider primarily known for Semaphore, a content intelligence platform prom-
ising improved control of and easier access to an organization’s unstructured data.
Smartlogic US
560 S. Winchester Blvd, Suite 500
San Jose, California, 95128
Phone: (408) 213-9500
Fax: (408) 572-5601
Synaptica: Synaptica is an innovation leader in the areas of enterprise taxonomy and metadata. Their platform integrates with Mi-
crosoft SharePoint, providing a method to store Synaptica’s taxonomy within SharePoint, facilitating unstructured document search.
Synaptica, LLC
11384 Pine Valley Drive
Franktown, CO 80116
Phone: (303) 298-1947
dataversity.net 16
1
19. Teragram: Recently acquired by SAS, Teragram is an expert in the world of linguistic search. They help organizations to better
manage unstructured content in a variety of languages allowing an enterprise to further develop its international presence.
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513-2414
Phone: (919) 677-8000
Content Management Systems
Documentum: Documentum remains one of the largest content management system (CMS) platforms in the industry. The software
facilitates the management of business documents as well as a host of other unstructured data types, including images, audio, and
video. Documentum is now owned by IT services conglomerate, EMC Corporation.
EMC Corporation
176 South Street
Hopkinton, MA 01748
Phone: (866) 438-3622
MarkLogic: MarkLogic makes enterprise software to help organizations manage unstructured data. Their system is based on the use
of XQuery for fast retrieval of documents marked up with metadata in XML format, and thus scales nicely when accessing Big Data
stores.
MarkLogic Corporation Headquarters
999 Skyway Road, Suite 200
San Carlos, CA 94070
Phone: (877) 992-8885
WordPress: WordPress is one of the most popular open source blogging and content management platforms. A robust community
has grown around the platform which leverages the MySQL and PHP open source solutions for database and scripting functionality.
Drupal: Drupal is another open source content management platform, but without the self-blogging focus of WordPress.
Discovery Systems
Verity K2: K2 is an enterprise search platform, or discovery system, used by organizations wanting to provide intelligent searching of
the mass of corporate unstructured data. Verity was acquired by Autonomy, which in turn was recently acquired by HP.
Autonomy US Headquarters
One Market Plaza
Spear Tower, Suite 1900
San Francisco, CA 94105
Phone: (415) 243 9955
Serials Solutions’ Summon Service: The Summon service from Serial Solutions is a Web-scale discovery system used pri-
marily by libraries. Summon provides search functionality on a full range of media, including audio, video, and e-content, in addition
to books.
dataversity.net 17
1
20. Serial Solutions North America
501 North 34th Street
Suite 300
Seattle, WA 98103-8645
Phone: (866) SERIALS (737-4257)
EBSCO Discovery Service: EBSCO’s Discovery Service facilitates discovery of an institution’s resources by combining pre-
indexed metadata from both internal and external sources to create a uniquely tailored search solution known for its speed. Although
known for their database and e-book services, EBSCO’s Discovery Service primarily supports research institutions and libraries.
EBSCO Publishing
10 Estes Street
Ipswich, MA 01938
Phone: (800) 653-2726 (USA Canada)
Fax: (978) 356-6565
OCLC WorldCat Local: WorldCat Local is a library-based discovery system provided by the Online Computer Library Center
(OCLC). The system provides single search box access to over 922 million items from library collections worldwide. OCLC is also
the organization responsible for first developing the Dublin Core Metadata Initiative.
OCLC Headquarters
6565 Kilgour Place
Dublin, OH 43017-3395
Phone: (614) 764-6000
Toll Free: (800) 848-5878 (USA and Canada only)
Fax: (614) 764-6096
Ex Libris Primo Central: The Primo Central Index is the centerpiece of Ex Libris’ Primo Discovery and Delivery discovery
system focused on providing access for the research scholar audience to hundreds of millions of documents. Ex Libris is a leading
provider of automation solutions for the library sciences market.
Ex Libris
1350 E Touhy Avenue, Suite 200 E
Des Plaines, IL 60018
Phone: (847) 296-2200
Fax: (847) 296-5636
Toll Free: (800) 762-6300
Metadata
Dublin Core Metadata Initiative: The Dublin Core Metadata Initiative is a metadata collection primarily used by libraries and
education institutions worldwide. It was originally developed by the Online Computer Library Center (OCLC), located in Dublin OH.
Esri ArcCatalog: Esri’s ArcCatalog is a tool within their ArcGIS software suite used for the development and management of
GIS-related metadata. Esri is a worldwide leader in the management of geographic data.
Esri Headquarters
380 New York Street
Redlands, CA 92373-8100
Phone: (909) 793-2853
dataversity.net 18
1
21. SchemaLogic MetaPoint: MetaPoint is a metadata tagging and management tool developed by SchemaLogic. Its primary audi-
ence is companies with large investments in the Microsoft-based document tools, Office, and SharePoint. MetaPoint promises to pro-
vide the missing connection between the two software products. SchemaLogic was recently acquired by taxonomy systems provider,
Smartlogic.
Smartlogic US
560 S. Winchester Blvd, Suite 500
San Jose, California, 95128
Phone: (408) 213-9500
Fax: (408) 572-5601
about dataversity
We provide a centralized location for training, online webinars, certification, news and more for information technology (IT) pro-
fessionals, executives and business managers worldwide. Members enjoy access to a deeper archive, leaders within the industry,
knowledge base and discounts off many educational resources including webinars and data management conferences. For questions,
feedback, ideas on future topics, or for more information please visit: http://www.dataversity.net/, or email: info@dataversity.net.
dataversity.net 19