The document discusses analyzing unstructured data and extracting structured information from it. It provides examples of data along a continuum from completely unstructured like images to highly structured like data in an Excel spreadsheet. It emphasizes that adding structure, like defining attributes with simple data types, makes the data more useful for tasks like analysis and visualization. Finding the right structure to impose on unstructured data involves understanding what information is needed from the data.
JUDGE BOBBY DeLAUGHTER - INDICTMENT
This is the Judge ASSIGNED the Mitchell McNutt & Sams Matter (i.e. Unemployment Benefits Issue). A TAINTED/CORRUPT Judge known to take BRIBES/KICKBACKS to "Throw Lawsuits."
Garretson Resolution Group appears to be FRONTING Law Firm for United States President Barack Obama and Legal Counsel/Advisor (Baker Donelson Bearman Caldwell & Berkowitz) which has submitted a SLAPP Complaint to OneWebHosting.com in efforts of PREVENTING the PUBLIC/WORLD from knowing of its and President Barack Obama's ROLE in CONSPIRACIES leveled against Vogel Denise Newsome in EXPOSING the TRUTH behind the 911 DOMESTIC TERRORIST ATTACKS, COLLAPSE OF THE WORLD ECONOMY, EMPLOYMENT violations and other crimes of United States Government Officials. Information that United States President Barack Obama, The Garretson Resolution Group, Baker Donelson Bearman Caldwell & Berkowitz, and United States Congress, etc. do NOT want the PUBLIC/WORLD to see. Information of PUBLIC Interest!
Grand Jury: Information, definitions, and explanations of what grand juriesAaron Davis
This web site was created by Susan Brenner, co-author of Federal Grand Jury Practice, a book about how federal grand juries operate and the role they play in federal law enforcement.
JUDGE BOBBY DeLAUGHTER - INDICTMENT
This is the Judge ASSIGNED the Mitchell McNutt & Sams Matter (i.e. Unemployment Benefits Issue). A TAINTED/CORRUPT Judge known to take BRIBES/KICKBACKS to "Throw Lawsuits."
Garretson Resolution Group appears to be FRONTING Law Firm for United States President Barack Obama and Legal Counsel/Advisor (Baker Donelson Bearman Caldwell & Berkowitz) which has submitted a SLAPP Complaint to OneWebHosting.com in efforts of PREVENTING the PUBLIC/WORLD from knowing of its and President Barack Obama's ROLE in CONSPIRACIES leveled against Vogel Denise Newsome in EXPOSING the TRUTH behind the 911 DOMESTIC TERRORIST ATTACKS, COLLAPSE OF THE WORLD ECONOMY, EMPLOYMENT violations and other crimes of United States Government Officials. Information that United States President Barack Obama, The Garretson Resolution Group, Baker Donelson Bearman Caldwell & Berkowitz, and United States Congress, etc. do NOT want the PUBLIC/WORLD to see. Information of PUBLIC Interest!
Grand Jury: Information, definitions, and explanations of what grand juriesAaron Davis
This web site was created by Susan Brenner, co-author of Federal Grand Jury Practice, a book about how federal grand juries operate and the role they play in federal law enforcement.
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/
Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure .Your data may be incomplete, the same information is represented in different ways by different sources, and it’s often vague
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
With the advent of Facebook’s Open Graph, HTML5 and Google’s Rich Snippets, the web has begun a rapid transformation to being understandable for computers. This understanding comes from data that is embedded in webpages and, perhaps more importantly, a new kind of hyperlink that connects concepts instead of documents. Information architects and interaction designers are needed now more than ever to make sense of all this data and to visualize it in new and interesting ways. In this presentation, you will learn how to take advantage of the Semantic Web’s foundational technology called Linked Data, which allows you to both produce and consume the data that is making up this new web.
Software engineering is inherently a collaborative venture, involving many stakeholders that coordinate their efforts to produce large software systems. While importance of human aspects in software engineering has been recognised already in the 1970s, emergence of open source software (late 1990s) and platforms such as Stack Overflow and GitHub (late 2000s) enabled application of empirical methods to study of human aspects of software engineering.
In the first part of the talk we present a selection of recent results pertaining to two main
questions: who are the software developers and in what kind of activities they engage. The second part of the talk focuses on tools and techniques that have been used to obtain
the aforementioned results.
Construction of Authority Information for Personal Names Focused on the Forme...tmra
Variant personal names for the same Japanese historical individuals exist, and when handling historical data it is desirable to control these. Furthermore, by grasping the position the family an individual belongs to within a genealogy or organization, it is possible to estimate the individual’s social position and the power he might command. At present there is no database providing such information in Japan, and there is a need to construct the authority information for personal names structured in a standardized data descriptive language. On this basis, the present study describes a project to construct authority information for the former Japanese noble families, which played a central role in the modernization of Japan, and for persons related to them, using topic map.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeneydatascienceiqss
The DataTags framework makes it easy for data producers to deposit, data publishers to store and distribute, and data users to access and use datasets containing confidential information, in a standardized and responsible way. The talk will first introduce the concepts and tools behind DataTags, and then focus on the user-facing component of the system - Tagging Server (available today at datatags.org). We will conclude by describing how future versions of Dataverse will use DataTags to automatically handle sensitive datasets, that can only be shared under some restrictions.
Talk at a Data Journalism BootCamp organised by ICFJ, World Bank Group and African Media Initiative in New Delhi to a group of 60 journalists, coders and social sector folks. Other amazing sessions included those from Govind Ethiraj of IndiaSpend, Andrew from BBC, Parul from Google, Nasr from HacksHacker, Thej from DataMeet and David from Code for Africa. http://delhi.dbootcamp.org/
Imagine that you have to integrate and search data from 200 different sources, each of which uses a different structure .Your data may be incomplete, the same information is represented in different ways by different sources, and it’s often vague
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerOpenSource Connections
To optimally interpret most natural language queries, it is necessary to understand the phrases, entities, commands, and relationships represented or implied within the search. Knowledge graphs serve as useful instantiations of ontologies which can help represent this kind of knowledge within a domain.
In this talk, we'll walk through techniques to build knowledge graphs automatically from your own domain-specific content, how you can update and edit the nodes and relationships, and how you can seamlessly integrate them into your search solution for enhanced query interpretation and semantic search. We'll have some fun with some of the more search-centric use cased of knowledge graphs, such as entity extraction, query expansion, disambiguation, and pattern identification within our queries: for example, transforming the query "bbq near haystack" into
{ filter:["doc_type":"restaurant"], "query": { "boost": { "b": "recip(geodist(38.034780,-78.486790),1,1000,1000)", "query": "bbq OR barbeque OR barbecue" } } }
We'll also specifically cover use of the Semantic Knowledge Graph, a particularly interesting knowledge graph implementation available within Apache Solr that can be auto-generated from your own domain-specific content and which provides highly-nuanced, contextual interpretation of all of the terms, phrases and entities within your domain. We'll see a live demo with real world data demonstrating how you can build and apply your own knowledge graphs to power much more relevant query understanding within your search engine.
With the advent of Facebook’s Open Graph, HTML5 and Google’s Rich Snippets, the web has begun a rapid transformation to being understandable for computers. This understanding comes from data that is embedded in webpages and, perhaps more importantly, a new kind of hyperlink that connects concepts instead of documents. Information architects and interaction designers are needed now more than ever to make sense of all this data and to visualize it in new and interesting ways. In this presentation, you will learn how to take advantage of the Semantic Web’s foundational technology called Linked Data, which allows you to both produce and consume the data that is making up this new web.
Software engineering is inherently a collaborative venture, involving many stakeholders that coordinate their efforts to produce large software systems. While importance of human aspects in software engineering has been recognised already in the 1970s, emergence of open source software (late 1990s) and platforms such as Stack Overflow and GitHub (late 2000s) enabled application of empirical methods to study of human aspects of software engineering.
In the first part of the talk we present a selection of recent results pertaining to two main
questions: who are the software developers and in what kind of activities they engage. The second part of the talk focuses on tools and techniques that have been used to obtain
the aforementioned results.
Construction of Authority Information for Personal Names Focused on the Forme...tmra
Variant personal names for the same Japanese historical individuals exist, and when handling historical data it is desirable to control these. Furthermore, by grasping the position the family an individual belongs to within a genealogy or organization, it is possible to estimate the individual’s social position and the power he might command. At present there is no database providing such information in Japan, and there is a need to construct the authority information for personal names structured in a standardized data descriptive language. On this basis, the present study describes a project to construct authority information for the former Japanese noble families, which played a central role in the modernization of Japan, and for persons related to them, using topic map.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
5. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle Road,
Monroeville, Alabama
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
6. 1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1057. Plaintiff-Intervenor, Brenda
Owens, is a citizen of Alabama and owns real property
located at 2105 Lane Avenue, Birmingham, Alabama
35217. Plaintiff is participating as a class representative in
the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated
herein by reference. 1058. Plaintiffs-Intervenors, Daniel
and Nicole Smith are citizens of Alabama and together
own real property located at 766 Tabernacle
Road, Monroeville, Alabama
http://www.propublica.org/documents/item/drywall-plaintiffs-omnibus-class-action-complaint
34. Images
Images Text Blob
unstructured structured
1056. Plaintiffs - Intervenors, Robert and Tasha Lambert
are citizens of Alabama and together own real property
located at 541 Lynn Hurst Court, Montgomery, Alabama
36117. Plaintiffs are participating as class representatives
in the class and subclasses as set forth in the schedules
accompanying this complaint which are incorporated herein
by reference. 1057. Plaintiff-Intervenor, Brenda Owens, is a
citizen of Alabama and owns real property located at 2105
Lane Avenue, Birmingham, Alabama 35217. Plaintiff is
participating as a class representative in the
35. Images
Images Text Blob Email
unstructured structured
36. Images
Images Text Blob Email
unstructured structured
Subject Re: IRE conference in Boston
Date June 1, 3:08PM
From jaimi@ire.org
37. Images
Images Text Blob Email Excel
unstructured structured
38. Images Text Blob Email Excel
unstructured structured
39. Images Text Blob Email Excel
unstructured structured
“It’s sunny
in texas”
40. Images Text Blob Email Excel
unstructured structured
“It’s sunny Tweet Weather Location
It’s sunny in Sunny Texas
in texas” texas
41. Images Text Blob Email Excel
unstructured structured
“It’s sunny Tweet Weather Location
It’s sunny in Sunny (37.06,
in texas” texas -95.67)
42. Whe You have unstructured data
n
What structure do I need?
Ask
Attributes with simple types
Find
43. What Am I Talking About?
• Structured Data 101
• Structured data continuum
• More Examples
44. 2011 State of the Union
http://www.boston.com/news/politics/specials/obama_state_of_the_union_word_cloud/
46. Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow
Americans:
Tonight I want to begin by
congratulating the men and
women of the 112th Congress, as
well as your new Speaker, John
Boehner. And as we mark this
occasion, we're also mindful of
the empty chair in this chamber,
and we pray for the health of our
colleague -- and our friend --
Gabby Giffords.
It's no secret that those of us here
tonight have had our differences
over the last two years. The
debates have been contentious;
we have fought fiercely for our
beliefs. And that's a good thing.
47. Mr. Speaker, Mr. Vice President,
members of Congress,
distinguished guests, and fellow Word
Americans:
Mr
Speaker
Tonight I want to begin by
congratulating the men and Vice
women of the 112th Congress, as President
well as your new Speaker, John
Members
Boehner. And as we mark this
occasion, we're also mindful of Congress
the empty chair in this chamber, Distinguished
and we pray for the health of our
Guests
colleague -- and our friend --
Gabby Giffords. Americans
People
It's no secret that those of us here Jobs
tonight have had our differences
New
over the last two years. The
debates have been contentious; years
we have fought fiercely for our
beliefs. And that's a good thing.
71. Structure = Super Valuable
When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types
72. Structure = Super Valuable
When You have unstructured data
Ask What structure do I need?
Find Attributes with simple types
tinyurl.com/iredatatipsheet
eugenewu@mit.edu
@sirrice
Editor's Notes
Hi I’m eugenewu.I was asked to talk about unstructured data, and after some thought, I figured I’ll..
Actually talk about structured dataIn particular, I want you to walk away with three thingsWhat is SD and why you should careHow to think about structured data in contrast to unstructured data. Specifically that data isn’t just …finallyA bunch of stories and visualizations and quick stories of how the author went from unstructured to structured dataLet me start with an example before talking about what structure means
Jeff Larson and Joaquin Sapien, ProPublica and Aaron Kessler, Sarasota Herald Tribunedid a really nice data journalism piece on the impact of tainted drywall on home ownerslot of homes built using drywall from China, emitted foul odors and frequently caused mysterious electronics failures. health problem in residentsAnd produced a really nice visualization of the counties affected by the tainted drywall. Darker blue = more tainted homesLet’s walk through How they went from unstructured data to this visualization?
They started with court documents from class action lawsuits and tax forms
And extracted the plain textFor example, This is a partial list of plaintiffs. There were about 2000 in this document
And they manually extracted the state and address information from the text.
They then geocoded the addresses to get latitude longitude information,
and finally the county that house belongs in.Doing this process for nearly 7000 addresses
reveals the number of tainted homes in each of the 150 counties.This table is imported into a visualization tool to construct…
The map that is shown on the propublica page.That was a fairly large number of steps.
If we take a quick look at their process, we can grossly simplify it down to the following steps.Take text from docsSpecifically address informationPlot it on google maps
And stepping back, to bring this back into the context of this talk, they start with unstructured information extract specific structured data, and visualize
What we’ll talk about in this talk is how to go from unstructured information to structured data.
But the first thing to do is to describe…
why the heck we carewhat structured data is
Who cares? Structured data makes your life easier in a number of ways.There’s lots of software Databases, panda to help you store and analyze structured data
In a similar vein, practically all visualization tools expect your data in some kind of structured format.
It can easily take a long of time to extract structured data from your documents. But now that you’ve got structured data about tainted homes in each county it can be easier to create mashups with other data.In contrast, there are not a lot of tools that work with unstructured data.
The canonical example of structured data is a table like this, that I’m sure you’ve seen either on the web in the wild, or on sites like google fusion tables. What makes structured data .. Structured?
For practical purposes, think of structured
as a bunch of attributesFor example each of 3 columns.Each attribute has a name and a data type
Why are names important?Let’s say you want to create that propublica map of each county
If I just stored the data in the table in a text life like on the right, Google maps has no idea what its trying to plot.I can’t point a map at that tex.
What I can say is “create a map and use county”. Since the attribute has a name the map can easily get the county names
The data type embodies the “meaning” of the attribute. It says “what does this attribute represent?”The more specific you can be, the better.
If the data type is a number then we can sort it, or take the sum or average.If we know it’s a type of numer (date/time) then we can use the hour, or month dataLat, lon can be plotted on a mapNon-numeric but still important are structured strings
Non-numeric but still important are structured strings. These are special because for any given thing like florida, there’s only one way to spell it.
This is important because something like florida could be spelled in numerous ways. The computer doesn’t know how to reconcile the differences.If we wanted the total number of tainted walls in florida, we would end up with
Getting a program to extract florida in a single unambiguous way is generally pretty hard, but its important.
Finally they should be consistent. In the sense that each row in your table, or each document in your dataset contains these attributesSometimes your strucutred data may not be in this kind of tabular format, but rather data attached to individual documents.
Hopefully I’ve convinced you that structured data is a good idea.Now I want to describe how sturctured data relates to unstructured data…
Specifically that Data isn’t unstructured or structured. It all lies on a continuum.I want to give you examples that span this spectrum and what data we may want out of them.
The name of the is moving towards the right,
Concretely, let’s say we have a bunch of tweets and we want to understand how the weather reported by the twitterverse differs across geographic areas.
We want to extract two pieces of structured data. Weather is a string containing “sunny”Location is a string corresponding to locationOr we could extract even more specific data type
By using ageocoding app to turn string texas into the latitude longitude coordinates.
I’ve summarized the process into something that helps calm my nerves, which iswhen I have …Is to ask What structure? Is it dollars? Adddresses?That helps target my search for finding…
I figure it would be nice to end with more examples.
http://www.wordle.net/createLast year, the globe produced a world cloud of Obama’s state of the union speech
An attribute that represents a single word in the speech. Perhaps with the punctuation removed
So we would start from the speech text and
Construct this single attribute table
Twitter released this graphic of the number of tweets per second referencing bin laden when he was captured earlier last year.
In this case tweets already contain the information we want – time.
Per capita availability of boneless, trimmed meat
We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.The nice part of this data is that it is often considered important, and can be found in a consistent location in the documents
Another example is the Deadly Day in Baghdad visualization produced byJACOB HARRIS and others the NYTimes, depicts the distribution of deaths in baghdad for a single day.Location of circle is latlon of where it happenedSize is how many peolp
This is an example of a wikileaks document the NYTimes had to work with.
KIA = killed in action. In this case, NYTimes extracted the data by hand. And sometimes this may be the case.But if the documents all looked like this (KIA at the top, WHERE:), it _may_ be possible to use pattern matching to extract this data.
Since much data about our lives is inexorably tied to where we live, we are often concerned with the regions that we live.This visualization shows number households per 1000 in regions throughout MA have lived there for 3+ generations – as a indicator on commitment to the region.
We need to extract two pieces of info. Similar to the iraq map, we need location information, but this time shapes of regions rather than single latlon coordinates.
iN this case, we are starting with what looks like structured data, and further extracting info
Person’s name.Extracting this type of information is called entity extraction, wher an entity may be a business name, famous person, etcThis is typically quite difficult, and requires an existing dataset of “important entities”
Finally, a popular analysis is to classify the unstructured documents. Categorizing by topic, or emotionTwitinfo is a tool by marcua to analyze tweets about particular topics. One of its features is analyzing the sentiment of the tweets.Here are 4 example tweets from last year talking about the Christchurch earthquake. Blue = +Red = -The pie chart shows that the tweets are overwhelmingly positive.
The structured data would then by happiness, and its type is a number between -1 and 1.there exist tools for specific types of analysis like sentiment or topicHowever
Be really careful with these types of automatic categorization tools
In all of the examples until that last one, what we’ve talked about amounted to pattern matching.This is really good. Tons of tools to do a good job
For example, the extracted sentiment of tweets about the new zealand earthquake was really positive!This is surprising because earthquakes are generally considered not so good.Because the tweets are all wishing the survivors the best, but these extractors don’t understand.
You can give your pile of documents to a thousand people who will extract the data you want quickly and cheaply.Mturk, crowd flower have more of an “anonymous workers” approach where someone will do your work, but you don’t know whoOdesk is more like directly hiring a contractorIn both cases, you’ll need to train the worker and deal with quality issues.
If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
If you have a bunch of the same forms, handwritten or not, captricity is a new startup that will take your forms, extract the parts you care about and return a nice, structured table containing the data.
If you care looking for people or places, Open Calais is a tool that automatically finds entities.Mario Monti is prime minister of Italy
But I’m going to give you a tip sheet later that also contains this and the other tools.
Just say the text!
Number of users, number of posts per day. Major posts that have been censored
----- Meeting Notes (6/12/12 00:16) -----put chi chu here instead
Thankfully the journalism and media studies program ----- Meeting Notes (6/12/12 00:39) -----change tweet to post
Shorter. Bo xilai falls from power.
Shorter. Bo xilai falls from power.
Shorter. Bo xilai falls from power.
We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
We extract information such as the ip address of the post, the post contents, the post date, the deletion date, the poster, and other information.
The most difficult is completely unstructured data. For exampleHand written letters, where we want the sender and recipient names
Or a scanned typewritten letter, and we want company and cate information
Or text files like the pro-publica example, where we want state and address data
A non text example would be scanned forms.In this case, Federal election contribution reports. Where we want the committee name and donation amounts and dates
Going towards the structured end, there is data that smells unstructured, but actually contains some structured data.For example, a tweet I wrote about trends in the database community contains more than just the text
In addition to the tweet text, which is unstructured, the Twitter API provides structured information Timestampof when the tweet was posted, my username, number of retweets, etc etc.That are all valuable to analyze without needing to process the actual text.
Similarly, emails contain structured data in the form of….
Subject, date, sender and tons more information.Later, Sudheendra will describe his email analyses tool that extractsspecific pieces of structured data and visualizes it.
Working directly with unstructured data is really really hard.Often times this requires manual work of analyzing documents one by one.
convince you that you can do a lot without messing too much with actual unstructured data.
Hello, my name Is eugene wu. I’m actually a student right across the river at MIT. I study databases. Not part of my PhD, but what I’m interested in is how reporters are dealing with and analyzing your data.
When I was asked to talk about anaylzing unstructured data for stories,hard time coming up with a talk.This is a fairly open ended topic, and I could talk about data scraping, visualization, extraction.The reason why there are so many techniques is thatDealing with unstructured data is very difficult and computers are terrible at it.
Also didn’t want to talk about a single tool because they are often used for specific types of data/analysesLooking for something that is useful for a general audienceThen I thought, hey’ I’m a database student, and we work with tables all the time!
The best ones are numerical data types. Computers are really really good at processing numerical values. They can easily show you the sum, or average, or look for trends.In fact pretty much every visualization tool, and analysis program will expect numerical data
If you can specify the type of numeric, then better. For example, lat lon then you can plot it on a map
Next are structured strings. These words where the meaning is different if the values are different. That is, there’s one way to say florida - capitalized florida.This is important when you want to ask “whats the total number of addresses in florida”?
Finally is random text. This is very akin to saying “this attribute is unstructured text”. Computers are horrible with this type of data because it’s so ambiguous----- Meeting Notes (6/12/12 18:11) -----know is a number, we know we can sort them, lat lon we can put it on a map. stop.