1. Oracle – Big Data
THE INTELLIGENCE LIFE-CYCLE
and Schema-Last Approach
Dr Neil Brittliff PhD
2. A little about myself…
Awarded a PhD at the University of Canberra in March this year for my work in the Big Data
space
Currently employed as Data Scientist within the Australian Government
Have been employed by 5 law enforcement agencies
Developed Cryptographic Software to support the Australian Medicare System
First used Oracle products back in 1986
Worked in the IT industry since 1982
Resides in Canberra (capital of Australia)
Canberra is the only capital city in Australia that is not named after a person
Interests
Tennis (play) / Cricket (watch)
Bushwalking and camping
Piano Playing (very bad)
Making stuff out of wood
Enjoys the art of Programming (prefers the ‘C’ language)
Pushing the limits of the Raspberry Pi
2
3. Talk Structure 3
Motivation
Principles and Constraints
Intelligence Life-Cycle
Collect & Collate
Analyse & Produce
Report & Disseminate
Motivation
Research
What is a Schema
The Problem with ETL
Data Cleansing verses Data Triage
A New Architecture
Oracle Big Data
The Schema-Last Approach
Indexing Technologies and Exploitation
User Reaction
Observations and Opportunities
4. National Criminal Intelligence 4
The Law Enforcement community are also in the business of collecting and analysing criminal
intelligence and data, and where possible, sharing that resulting information…
To do this, they need rich, contemporary, and comprehensive criminal intelligence…
The National Criminal Intelligence Fusion Capability, which brings together subject matter
experts, analysts, technology and big data to identify previously unknown criminal entities,
criminal methodologies, and patterns of crime.
Fusion capability identifies the threats and vulnerabilities through the use of data.
It brings together, monitors and analyses data and information from Customs, other law
enforcement, Government agencies and industry to build an intelligence picture of serious and
organised crime in Australia.
5. Australian Institute of Criminology 5
• While many of the challenges posed by the volume of data are
addressed in part by new developments in technology, the
underlying issue has not been adequately resolved.
• Over many years, there have been a variety of different ideas put
forward in relation to addressing the increasing volume of data,
such as data mining.
Darren Quick and Kim-Kwang Raymond Choo
Australian Institute of Criminology
September 2014
6. Objectives 6
Support the Australian Intelligence Criminal Model
Simple Interface to exploit the data
Data ingestion must be simple to do
and minimise transformation
Support the large variety of data sources
Fast ingestion and retrieval times
Enable exact and fuzzy searching
Support ‘Identity Resolution’
Support metadata
Main the data’s integrity
Preserve Data-Lineage/Provenance
Reproduce the ingested data source
exactly!
We don’t want this!
7. The Intelligence Life-Cycle 7
Plan, prioritise &
direct
Collect & collate
Report &
disseminate
Analyse &
produce
Evaluate & review
8. Intelligence –
Data Source Classification
8
Low
95%
High
5%
DATA SOURCE CLASSIFICATION
Low High
Collect&collateAnalyse&produce
9. Some Definitions: 9
That a major problem for the data scientist is to
flatten the bumps as a result of the heterogeneity of
data.
Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter
experience.
Collect&Collate
Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal
representation of data model which has integrity constraints controlling
permissible data values.
Data munging or sometimes referred to as data wrangling means taking
data that’s stored
in one format and changing it into another format.
11. Data Cleansing …
11
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality of
data.
“Data cleansing is the process of analysing the quality of data in a data source, manually
approving/rejecting the suggestions by the system, and thereby making changes to
the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted
process that analyses how data conforms to the knowledge in a knowledge base, and an
interactive process that enables the data steward to review and modify computer-assisted
process results to ensure that the data cleansing is exactly as they want to be done.”
Microsoft: 2012
Collect&Collate
13. Data Cleansing - Doesn’t
WORK
13
“Data cleansing can be time-consuming and tedious, but robust
estimators are not a substitute for careful examination of the data for
clerical errors and other problems. ”
David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a
new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.
“Formal data cleansing can easily overwhelm any human or perhaps
the computing capacity of an organization.”
N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings
of the RSPA, 2014.
“that the data volume may overwhelm the Extract Transform Load
process and that data cleansing may introduce unintentional errors.”
Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.
Collect&Collate
14. Data Cleansing –
Loss of Format
14
Input Date Cleansed Date Comment
20 July 2014 20-07-2014 Australian Date
July-20-2014 20-07-2014 American
Format
(mmm-dd-yyyy)
2014-20-07 20-07-2014 Arabic Format
(right to left)
20-07-14 20-07-2014 Data Ambiguity
July 2014 01-07-2014 Imputed Value
"If you torture the data long enough, it will confess.“
Clifton R. Musser
Collect&Collate
15. ETL vs Triage 15
Initiate
Extract
Determine
Suitability?
Transform
n
Assessment?
Load
Report
Complete
n
Initiate
Triage
Load
Suitability?
Application
n
Verify?
Fuse
Resolve
Complete
n
Collect&Collate
18. Data Storage/Collation 18
Store the Data Semantically
Built on an defined taxonomy/ontology
Perfect to capture metadata
Searched for the perfect Triple-Store
Subject Predicate Object
Triple
Graph
List
Collect&Collate
19. The Architecture 19
Collect & Collate Analyse & Produce
Set Store
Hbase
Historical
Data
New
Data
RDF/Modelling
Feeds
DataExploration
SemanticStore
Disseminate
Index
IIR
Index
SOLR
BDA
Palantir
SearchAssistant
Data Flow
DataExploitation
SPARQL
R Language
Apache PIG
20. Schema Last … 20
‘Triaged’ Data
First Name
Middle
Name
Last Name
Schema
Full-Name
Street Number
Street Name
Suburb
State
Postcode
Full-Address
Collect&Collate
Models
21. ACC Search Engines –
‘Smackdown’
21
Feature SOLR IIR
License Apache License Commercial
Storage Inverted List Third-party
Database
Support Google Like search Next
Release
Score Model Inverse Document
Frequency
Normalized
Score
Result Pagination
Homophone Support Can use synonym
support
Phoneme Search
Spread indexes across multiple nodes
Schema-less Support
Programming Interface Rest SOAP - API
Geo-spatial
Collect&Collate
25. User Reaction 25
Time to Triage
< 1 Hour
> 1 Hour < 24
Hour
General Size % - Megabytes
< 1
• Developed a Palantir Plugin to
search the Fusion Data Holding
• Bulk Matching was a great
success
• In general, user reaction has
been positive
• Time to Triage was usually
under an hour where cleansing
could take weeks!!!
27. Observations… 27
The Bulk Matcher
Performance and Reliability
Interaction with Palantir
Configuration over Customisation
Search for the ‘Single Source of Truth’
Golden Record
Acceptance of the Schema Last Approach
Overwhelmed by Search Results
28. Further Reading and
Contacts
28
Strategic Thinking in Criminal Intelligence
Jerry H Ratcliffe
The Federation Press – 2009
ISBN 978 186287 734-4
Intelligence-Led Policing
Jerry Ratcliffe
Routledge – 2008
ISBN 978-1-843292-339-8
Data Matching
Concepts and Techniques and Record Linkage, Entity Resolution, and
Duplicate Detection
Peter Christen
Springer – 2012
ISBN 978-3-642-31163-5
Foundations of Semantic Web Technologies
Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph
CRC Press – 2010
ISBN 978-1-4200-9050-5
Big Data – A revolution that will transform how we live, work, and
think
Viktor Mayer-Schönberger and Kenneth Cukier
HMH – 2013
ISBN 978-0-544-00269-2
Sharma The Schema Last Approach to Data Fusion
Neil Brittliff and Dharmendra Sharma
The Schema Last Approach to Data Fusion
AusDM 2014
A Triple Store Implementation to support Tabular Data
Neil Brittliff and Dharmendra Sharma
AusDM 2014
Australian Institute of Criminology
http://www.aic.gov.au
University of Canberra
http://www.Canberra.edu.au
Editor's Notes
Thanks
Vladimir Videnovic
Richard Foote
Vicky Faulkner
Dharmendra Sharma (my PhD supervisor)
Introduction
About myself – worked for 5 law enforcement agencies
The AICM (The Australian Intelligence Criminal Model)
These are the components will focus on:
Collect & Collate
Analyse & Produce
Intelligence is an integral part of the ACC remit and used to identify new criminal and
monitor existing known targets. The intelligence cycle is the process of developing unrefined
data from multiple data sources then analyst the ’fused’ data sources. The ACC and
many other law enforcement agencies see that Big Data enables the collection to store and
process data at a unprecedented rate that is only going to increase. An integral process of
the Intelligence cycle is the collection and processing of raw data. In addition, the the scale,
complexity and changing nature of intelligence data can make it impossible to stay in front
without the aid of technology to collect, process and analyze big data.
he Australian Institute of Criminology is Australia's national research and knowledge centre on crime and justice.
The Skywhale is a hot air balloon designed by the sculptor Patricia Piccinini as part of a commission to mark the centenary of the city of Canberra. It was built by Cameron Balloons in Bristol, United Kingdom, and first flew in Australia in 2013. The balloon's design received a mixed response after it was publicly unveiled in May 2013.
The cost of the balloon and the arrangements under which it was funded also attracted criticism. The executive director of culture for the ACT Chief Minister’s directorate informed the media on 9 May that the balloon and its supporting website cost about $170,000. Documents released the next day showed that the total cost to the government of commissioning and operating The Skywhale over its lifespan will be $300,000, and the philanthropic Aranday Foundation will provide a further $50,000. Moreover, the balloon will remain the property of the Melbourne-based company Global Ballooning and only one flight was scheduled for Canberra
The intelligence life-cycle central focus is data and data exploitation. The intelligence life cycle begins with the identification of possible data source, the collection and collation of the data. The analysis and application of models upon this data. The production and dissemination of situation reports and finally an evaluation and review of the entire intelligence life-cycle. Hoover the life-cycle as it will be shown must deal with:
messy and noisy data.
structured, semi-structured, and unstructured data.
tabular and highly linked data.
Cross Industry Standard for Data Mining (CRISP-DM)
The Cross-Industry Standard Process for Data Mining (CRISP–DM). CRISP–DM a given data mining project has a life cycle consisting of six phases, That is, the next phase in the sequence often depends on the outcomes associated with the preceding phase. There
may be further data preparation phase for further refinement before moving forward to the model evaluation phase. The six phases are as follows:
Business understanding phase .
The first phase in the CRISP–DM standard process may also be termed the research understanding phase .
Enunciate the project objectives and requirements clearly in terms of the business or research unit as a whole.
Translate these goals and restrictions into the formulation of a data mining problem
definition.
Prepare a preliminary strategy for achieving these objectives.
Data understanding phase
Collect the data.
Use exploratory data analysis to familiarize yourself with the data, and discover initial insights.
Low Signal – Usually List data that has no criminal significance
High Signal – The opposite list that may be significant to an investigation
Variety was the real problem
Schema is used to describe relational tabular, hierarchical or graph structures. Usually, schema is used to identify how the data is to be stored or transported. For sources without schema, such as files, there are few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (for example: the relational approach requires simple attribute values, referential integrity, et cetera) as well as application-specific integrity constraints.
Data munging or sometimes referred to as data wrangling means taking data that’s stored
in one format and changing it into another format. Analysts regularly wrangle data into
a form suitable for computational tools through a tedious process that delays more substantive
analysis. The are tools both interactive and command line that can assist data
transformation, analysts must still conceptualize the desired output state, formulate a transformation
strategy, and specify complex transforms.
The `Schema-First' may mean a loss of data quality at any one of these stages and reduce the applicability. These include (Chapman, 2005):
• Data capture and recording at the time of gathering.
• Data manipulation prior to digitization (label preparation), identification of the collection and its recording.
• Digitization of the data.
• Documentation of the data (capturing and recording the meta-data).
• Data storage and archiving.
• Data presentation and dissemination (paper and electronic publications, web-enabled databases, et cetera).
• Data use (analysis and manipulation). All these have an input into the final quality or `fitness for use' of the data and all apply to all aspects of the data – the taxonomic or nomenclature portion of the data – the `what', the spatial portion – the `where' and other data such as the `who' and the temporal `when'.
Schema First
Requires a Cleansing step
Popular amongst data scientists
Analysis will happen only on the Cleansed Data
The organisation in question was utilising the Traditional Schema First Approach
Utilises ETL Extract Transform Load technologies
Cannot produce the data exactly
Examples of Schemas are:
AVRO
ANS-1
XSD
Schema Last
Also known as schema on exploitation
Requires no Data Cleansing
Analysis can occur on the raw data
Found this to be a more effective model than Schema-First
A data cleaning approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources.
Furthermore, data cleaning should not be performed in isolation but together with schema related
data transformations based on comprehensive meta-data. Mapping functions for
data cleaning and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. Especially for data warehouses, a work-flow infrastructure should be supported to execute all data transformation
steps for multiple sources and large data sets in a reliable and efficient way. As argued by David Ruppet an esteem member of the American Statistics Association pertaining to inconsistent results in relation to statistical sampling: (Ruppert, 2002)
“Data cleansing can be time-consuming and tedious, but robust estimators
are not a substitute for careful examination of the data for clerical errors and
other problems.”
The Gap between was forever increasing
The management were not happy
This did not address operational or tactical intelligence
Human Cleansing
Often the data cleansing is a manual process where a human manual trawls through
the data and correct typographical or errors or makes some determination of what the data represents.
The data for example may be a list of rate payers for a capital city. The rate pay may a household
owner (owner occupier) or an organization. The council does not distinguish (or probably care)
if the rate payer is an individual or organization as long as the rates are payed. This does however present a
problem, where this may matter for an intelligence gathering is it David Jones the person
or David Jones the organization. If this does matter to the intelligence gathering system
then information or in particular entity resolution.
Automated Cleansing
Automated cleansing is whereby a set of automated rules are applied to the to the data and
can modify, merge or split the data into a format suitable for ingestion. Regular expression
are suitable to determine to match or extract parts contained within the string. A challenge
with this approach within the Australian Crime Commission is the coercive powers do
dictate to an organization that are required to supply the agency the data but cannot dictate
the form or structure of the data. This poses a significant impost on any automated process,
however the majority of the data does come in the form of a comma separated file (csv)
which is relatively simple to automate. Some outliers are more difficult to process and may
be impossible to automate. This process or technology is referred to as Extract, Transform
and Load (ETL).
It is easy to demonstrate that Loss of Format can ultimately lead to loss of data and that it is never a good idea to impute data if the data is missing. Other techniques for example determining the most likely candidate based on statistical methods should be avoided
ETL may alter the data
Triage keeps the raw data intact
Triage may require data reformatting but no data transformation
It is a very messy world
It is only getting more complicated
Note:
Cloudera Path comes at no cost with BDA
Berkley DB (NOSQL DB) has a cost
The RDF triple store allows the storage and retrieval of any data structure and well suited
to store the schema last artifacts. Therefore, the triple store can store both the data
and schema structure. Ideally, a triple store graph would only contain a specific store’s data
and schema structure.
Analytics Architecture.
It is not clear yet how an optimal architecture of an analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real-time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, extensible, allows ad hoc queries, minimal maintenance, and debuggable.
Concept of a Table exists.
Each table has it's own key space.
You can add and remove table as easily as a RDBMS.
Uses binary keys. It's common to combine many different items together to form a key.
Data Consistency by design.
Offers a nice convenience method to increment counters. Very much suitable for data aggregation.
Map Reduce support is native.
HBase is built on Hadoop.
Data does not get transferred.
Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself.
Comes with both a thrift/rest interface.
We developed a simple but effective modelling system to map the triaged data.
It was possible that the data was not in first normal form
The Schema mapped to domains
ESRI will only require the Address for geocoding
Informatica IIR was based on a product SSANAME which was developed in Canberra.
SOLR5 now provides a Schema-less support
The thin client (Bongo) allowed an analyst to map the data for further processing .
There is both a thin and fat client
Unlike other triple store implementations, Sesame provides several reference implementations but allows other third party to provide additional alternates
The Fusion Data Holding or FDH is a single repository created at the ACC to house the big data repository. There are a number of mandatory requirements that must be met for any design to succeed, which are:
1. Data must not be modified.
2. If the data was ordered - then the order could matter an must be retained.
3. The provenance of data is important and must be maintained and not lost through the data life-cycle.
4. The data must be able to be annotated which also must not be lost.
5. Data must be extracted in bulk quickly and efficiently.
Palantir was the delivery mechanism by choice for data dissemination