SlideShare a Scribd company logo
1 of 28
Oracle – Big Data
THE INTELLIGENCE LIFE-CYCLE
and Schema-Last Approach
Dr Neil Brittliff PhD
A little about myself…
 Awarded a PhD at the University of Canberra in March this year for my work in the Big Data
space
 Currently employed as Data Scientist within the Australian Government
 Have been employed by 5 law enforcement agencies
 Developed Cryptographic Software to support the Australian Medicare System
 First used Oracle products back in 1986
 Worked in the IT industry since 1982
 Resides in Canberra (capital of Australia)
 Canberra is the only capital city in Australia that is not named after a person
 Interests
 Tennis (play) / Cricket (watch)
 Bushwalking and camping
 Piano Playing (very bad)
 Making stuff out of wood
 Enjoys the art of Programming (prefers the ‘C’ language)
 Pushing the limits of the Raspberry Pi
2
Talk Structure 3
 Motivation
 Principles and Constraints
 Intelligence Life-Cycle
 Collect & Collate
 Analyse & Produce
 Report & Disseminate
 Motivation
 Research
 What is a Schema
 The Problem with ETL
 Data Cleansing verses Data Triage
 A New Architecture
 Oracle Big Data
 The Schema-Last Approach
 Indexing Technologies and Exploitation
 User Reaction
 Observations and Opportunities
National Criminal Intelligence 4
 The Law Enforcement community are also in the business of collecting and analysing criminal
intelligence and data, and where possible, sharing that resulting information…
 To do this, they need rich, contemporary, and comprehensive criminal intelligence…
 The National Criminal Intelligence Fusion Capability, which brings together subject matter
experts, analysts, technology and big data to identify previously unknown criminal entities,
criminal methodologies, and patterns of crime.
 Fusion capability identifies the threats and vulnerabilities through the use of data.
 It brings together, monitors and analyses data and information from Customs, other law
enforcement, Government agencies and industry to build an intelligence picture of serious and
organised crime in Australia.
Australian Institute of Criminology 5
• While many of the challenges posed by the volume of data are
addressed in part by new developments in technology, the
underlying issue has not been adequately resolved.
• Over many years, there have been a variety of different ideas put
forward in relation to addressing the increasing volume of data,
such as data mining.
Darren Quick and Kim-Kwang Raymond Choo
Australian Institute of Criminology
September 2014
Objectives 6
 Support the Australian Intelligence Criminal Model
 Simple Interface to exploit the data
 Data ingestion must be simple to do
 and minimise transformation
 Support the large variety of data sources
 Fast ingestion and retrieval times
 Enable exact and fuzzy searching
 Support ‘Identity Resolution’
 Support metadata
 Main the data’s integrity
 Preserve Data-Lineage/Provenance
 Reproduce the ingested data source
exactly!
We don’t want this!
The Intelligence Life-Cycle 7
Plan, prioritise &
direct
Collect & collate
Report &
disseminate
Analyse &
produce
Evaluate & review
Intelligence –
Data Source Classification
8
Low
95%
High
5%
DATA SOURCE CLASSIFICATION
Low High
Collect&collateAnalyse&produce
Some Definitions: 9
That a major problem for the data scientist is to
flatten the bumps as a result of the heterogeneity of
data.
Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter
experience.
Collect&Collate
Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal
representation of data model which has integrity constraints controlling
permissible data values.
Data munging or sometimes referred to as data wrangling means taking
data that’s stored
in one format and changing it into another format.
Analyse
AnalyseStorage
Schema Application 10
SchemaFirst
Raw Data
Triage
Cleanse
Raw Data Storage
SchemaLast
Schema
Schema
Data Cleansing …
11
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality of
data.
“Data cleansing is the process of analysing the quality of data in a data source, manually
approving/rejecting the suggestions by the system, and thereby making changes to
the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted
process that analyses how data conforms to the knowledge in a knowledge base, and an
interactive process that enables the data steward to review and modify computer-assisted
process results to ensure that the data cleansing is exactly as they want to be done.”
Microsoft: 2012
Collect&Collate
Data Sources –
Always Increasing
12
Gap
Collect&Collate
Data Cleansing - Doesn’t
WORK
13
“Data cleansing can be time-consuming and tedious, but robust
estimators are not a substitute for careful examination of the data for
clerical errors and other problems. ”
David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a
new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002.
“Formal data cleansing can easily overwhelm any human or perhaps
the computing capacity of an organization.”
N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings
of the RSPA, 2014.
“that the data volume may overwhelm the Extract Transform Load
process and that data cleansing may introduce unintentional errors.”
Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007.
Collect&Collate
Data Cleansing –
Loss of Format
14
Input Date Cleansed Date Comment
20 July 2014 20-07-2014 Australian Date
July-20-2014 20-07-2014 American
Format
(mmm-dd-yyyy)
2014-20-07 20-07-2014 Arabic Format
(right to left)
20-07-14 20-07-2014 Data Ambiguity
July 2014 01-07-2014 Imputed Value
"If you torture the data long enough, it will confess.“
Clifton R. Musser
Collect&Collate
ETL vs Triage 15
Initiate
Extract
Determine
Suitability?
Transform
n
Assessment?
Load
Report
Complete
n
Initiate
Triage
Load
Suitability?
Application
n
Verify?
Fuse
Resolve
Complete
n
Collect&Collate
We did our research … 16
Oracle’s BDA
(Big Data Appliance)
17
Collect&Collate
Data Storage/Collation 18
 Store the Data Semantically
 Built on an defined taxonomy/ontology
 Perfect to capture metadata
 Searched for the perfect Triple-Store
Subject Predicate Object
Triple
Graph
List
Collect&Collate
The Architecture 19
Collect & Collate Analyse & Produce
Set Store
Hbase
Historical
Data
New
Data
RDF/Modelling
Feeds
DataExploration
SemanticStore
Disseminate
Index
IIR
Index
SOLR
BDA
Palantir
SearchAssistant
Data Flow
DataExploitation
SPARQL
R Language
Apache PIG
Schema Last … 20
‘Triaged’ Data
First Name
Middle
Name
Last Name
Schema
Full-Name
Street Number
Street Name
Suburb
State
Postcode
Full-Address
Collect&Collate
Models
ACC Search Engines –
‘Smackdown’
21
Feature SOLR IIR
License Apache License Commercial
Storage Inverted List Third-party
Database
Support Google Like search  Next
Release
Score Model Inverse Document
Frequency
Normalized
Score
Result Pagination 
Homophone Support Can use synonym
support

Phoneme Search  
Spread indexes across multiple nodes  
Schema-less Support 
Programming Interface Rest SOAP - API
Geo-spatial  
Collect&Collate
Collect & Collation Tool 22
Collect&Collate
Bongo – Exploration 23
Analyse&Produce
Palantir – Semantic Interface 24
Report&Disseminate
User Reaction 25
Time to Triage
< 1 Hour
> 1 Hour < 24
Hour
General Size % - Megabytes
< 1
• Developed a Palantir Plugin to
search the Fusion Data Holding
• Bulk Matching was a great
success
• In general, user reaction has
been positive
• Time to Triage was usually
under an hour where cleansing
could take weeks!!!
Ingestion Rate –
The Improvement
26
Collect&Collate
Observations… 27
 The Bulk Matcher
 Performance and Reliability
 Interaction with Palantir
 Configuration over Customisation
 Search for the ‘Single Source of Truth’
 Golden Record
 Acceptance of the Schema Last Approach
 Overwhelmed by Search Results
Further Reading and
Contacts
28
 Strategic Thinking in Criminal Intelligence
Jerry H Ratcliffe
The Federation Press – 2009
ISBN 978 186287 734-4
 Intelligence-Led Policing
Jerry Ratcliffe
Routledge – 2008
ISBN 978-1-843292-339-8
 Data Matching
Concepts and Techniques and Record Linkage, Entity Resolution, and
Duplicate Detection
Peter Christen
Springer – 2012
ISBN 978-3-642-31163-5
 Foundations of Semantic Web Technologies
Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph
CRC Press – 2010
ISBN 978-1-4200-9050-5
 Big Data – A revolution that will transform how we live, work, and
think
Viktor Mayer-Schönberger and Kenneth Cukier
HMH – 2013
ISBN 978-0-544-00269-2
 Sharma The Schema Last Approach to Data Fusion
Neil Brittliff and Dharmendra Sharma
The Schema Last Approach to Data Fusion
AusDM 2014
 A Triple Store Implementation to support Tabular Data
Neil Brittliff and Dharmendra Sharma
AusDM 2014
Australian Institute of Criminology
http://www.aic.gov.au
University of Canberra
http://www.Canberra.edu.au

More Related Content

What's hot

Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry ReportRan Zhang
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltoolssuresh sood
 
Use case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & DremioUse case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & DremioAltinity Ltd
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
 
TERN data sharing at TRY workshop
TERN data sharing at TRY workshopTERN data sharing at TRY workshop
TERN data sharing at TRY workshopTERN Australia
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsAditya Parameswaran
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big DataRevolution Analytics
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldDez Blanchfield
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 

What's hot (20)

Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Total Data Industry Report
Total Data Industry ReportTotal Data Industry Report
Total Data Industry Report
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Bigdatacooltools
BigdatacooltoolsBigdatacooltools
Bigdatacooltools
 
Use case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & DremioUse case and integration of ClickHouse with Apache Superset & Dremio
Use case and integration of ClickHouse with Apache Superset & Dremio
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...
 
TERN data sharing at TRY workshop
TERN data sharing at TRY workshopTERN data sharing at TRY workshop
TERN data sharing at TRY workshop
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation Systems
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez BlanchfieldBig Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 

Viewers also liked

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope Options
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope OptionsOracle ADF Architecture TV - Design - Task Flow Data Control Scope Options
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope OptionsChris Muir
 
Steam turbine & steam power plant (unit 1 & 2 ) sementer-7
Steam turbine & steam power plant (unit  1 & 2 ) sementer-7Steam turbine & steam power plant (unit  1 & 2 ) sementer-7
Steam turbine & steam power plant (unit 1 & 2 ) sementer-7Mohammed Sheikh
 
Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Riccardo Romani
 
Oracle Cloud Networking And Security Exposed
Oracle Cloud Networking And Security Exposed Oracle Cloud Networking And Security Exposed
Oracle Cloud Networking And Security Exposed Riccardo Romani
 
The Transformational Play for Utilities
The Transformational Play for UtilitiesThe Transformational Play for Utilities
The Transformational Play for Utilitiesaccenture
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing pptMehul Patel
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Viewers also liked (14)

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope Options
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope OptionsOracle ADF Architecture TV - Design - Task Flow Data Control Scope Options
Oracle ADF Architecture TV - Design - Task Flow Data Control Scope Options
 
Joulex & Junos Space SDK: Customer Success Story
Joulex & Junos Space SDK: Customer Success StoryJoulex & Junos Space SDK: Customer Success Story
Joulex & Junos Space SDK: Customer Success Story
 
Steam turbine & steam power plant (unit 1 & 2 ) sementer-7
Steam turbine & steam power plant (unit  1 & 2 ) sementer-7Steam turbine & steam power plant (unit  1 & 2 ) sementer-7
Steam turbine & steam power plant (unit 1 & 2 ) sementer-7
 
Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation Utilities Digital Data Driven Innovation
Utilities Digital Data Driven Innovation
 
Oracle Cloud Networking And Security Exposed
Oracle Cloud Networking And Security Exposed Oracle Cloud Networking And Security Exposed
Oracle Cloud Networking And Security Exposed
 
The Transformational Play for Utilities
The Transformational Play for UtilitiesThe Transformational Play for Utilities
The Transformational Play for Utilities
 
Virtualization in cloud computing ppt
Virtualization in cloud computing pptVirtualization in cloud computing ppt
Virtualization in cloud computing ppt
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Oracle openworld-presentation

KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleDr. Radhey Shyam
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodKarry Lu
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Gianluca Tarasconi
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfDr. Radhey Shyam
 
ACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery AnalyticsACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery AnalyticsAnnelore van der Lint
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfGraceOkeke3
 
How new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finalHow new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finaljcscholtes
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...KamleshKumar394
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...Juan Mateos-Garcia
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 

Similar to Oracle openworld-presentation (20)

KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
ACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery AnalyticsACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
ACEDS - ZyLAB webinar - AI Based eDiscovery Analytics
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
Challenges of Big Data Research
Challenges of Big Data ResearchChallenges of Big Data Research
Challenges of Big Data Research
 
How new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finalHow new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-final
 
Datascience.pptx
Datascience.pptxDatascience.pptx
Datascience.pptx
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...The profile of the management (data) scientist: Potential scenarios and skill...
The profile of the management (data) scientist: Potential scenarios and skill...
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 

Recently uploaded

FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.ppt
FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.pptFINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.ppt
FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.pptjudeplata
 
Indemnity Guarantee Section 124 125 and 126
Indemnity Guarantee Section 124 125 and 126Indemnity Guarantee Section 124 125 and 126
Indemnity Guarantee Section 124 125 and 126Oishi8
 
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书SD DS
 
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书Fir L
 
John Hustaix - The Legal Profession: A History
John Hustaix - The Legal Profession:  A HistoryJohn Hustaix - The Legal Profession:  A History
John Hustaix - The Legal Profession: A HistoryJohn Hustaix
 
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书SD DS
 
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTS
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTSVIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTS
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTSDr. Oliver Massmann
 
Key Factors That Influence Property Tax Rates
Key Factors That Influence Property Tax RatesKey Factors That Influence Property Tax Rates
Key Factors That Influence Property Tax RatesHome Tax Saver
 
Cleades Robinson's Commitment to Service
Cleades Robinson's Commitment to ServiceCleades Robinson's Commitment to Service
Cleades Robinson's Commitment to ServiceCleades Robinson
 
如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书
 如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书 如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书
如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书Sir Lt
 
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptx
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptxConstitutional Values & Fundamental Principles of the ConstitutionPPT.pptx
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptxsrikarna235
 
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书FS LS
 
A Short-ppt on new gst laws in india.pptx
A Short-ppt on new gst laws in india.pptxA Short-ppt on new gst laws in india.pptx
A Short-ppt on new gst laws in india.pptxPKrishna18
 
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书SD DS
 
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...Dr. Oliver Massmann
 
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书Fs Las
 
Trial Tilak t 1897,1909, and 1916 sedition
Trial Tilak t 1897,1909, and 1916 seditionTrial Tilak t 1897,1909, and 1916 sedition
Trial Tilak t 1897,1909, and 1916 seditionNilamPadekar1
 

Recently uploaded (20)

FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.ppt
FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.pptFINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.ppt
FINALTRUEENFORCEMENT OF BARANGAY SETTLEMENT.ppt
 
Sensual Moments: +91 9999965857 Independent Call Girls Vasundhara Delhi {{ Mo...
Sensual Moments: +91 9999965857 Independent Call Girls Vasundhara Delhi {{ Mo...Sensual Moments: +91 9999965857 Independent Call Girls Vasundhara Delhi {{ Mo...
Sensual Moments: +91 9999965857 Independent Call Girls Vasundhara Delhi {{ Mo...
 
Indemnity Guarantee Section 124 125 and 126
Indemnity Guarantee Section 124 125 and 126Indemnity Guarantee Section 124 125 and 126
Indemnity Guarantee Section 124 125 and 126
 
Russian Call Girls Rohini Sector 7 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 7 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...Russian Call Girls Rohini Sector 7 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
Russian Call Girls Rohini Sector 7 💓 Delhi 9999965857 @Sabina Modi VVIP MODEL...
 
Old Income Tax Regime Vs New Income Tax Regime
Old  Income Tax Regime Vs  New Income Tax   RegimeOld  Income Tax Regime Vs  New Income Tax   Regime
Old Income Tax Regime Vs New Income Tax Regime
 
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书
如何办理(UNK毕业证书)内布拉斯加大学卡尼尔分校毕业证学位证书
 
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书
如何办理新加坡南洋理工大学毕业证(本硕)NTU学位证书
 
John Hustaix - The Legal Profession: A History
John Hustaix - The Legal Profession:  A HistoryJohn Hustaix - The Legal Profession:  A History
John Hustaix - The Legal Profession: A History
 
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书
如何办理(GWU毕业证书)乔治华盛顿大学毕业证学位证书
 
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTS
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTSVIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTS
VIETNAM – LATEST GUIDE TO CONTRACT MANUFACTURING AND TOLLING AGREEMENTS
 
Key Factors That Influence Property Tax Rates
Key Factors That Influence Property Tax RatesKey Factors That Influence Property Tax Rates
Key Factors That Influence Property Tax Rates
 
Cleades Robinson's Commitment to Service
Cleades Robinson's Commitment to ServiceCleades Robinson's Commitment to Service
Cleades Robinson's Commitment to Service
 
如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书
 如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书 如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书
如何办理(MSU文凭证书)密歇根州立大学毕业证学位证书
 
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptx
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptxConstitutional Values & Fundamental Principles of the ConstitutionPPT.pptx
Constitutional Values & Fundamental Principles of the ConstitutionPPT.pptx
 
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书
如何办理伦敦南岸大学毕业证(本硕)LSBU学位证书
 
A Short-ppt on new gst laws in india.pptx
A Short-ppt on new gst laws in india.pptxA Short-ppt on new gst laws in india.pptx
A Short-ppt on new gst laws in india.pptx
 
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书
如何办理(UCD毕业证书)加州大学戴维斯分校毕业证学位证书
 
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...
Legal Alert - Vietnam - First draft Decree on mechanisms and policies to enco...
 
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书
如何办理(SFSta文凭证书)美国旧金山州立大学毕业证学位证书
 
Trial Tilak t 1897,1909, and 1916 sedition
Trial Tilak t 1897,1909, and 1916 seditionTrial Tilak t 1897,1909, and 1916 sedition
Trial Tilak t 1897,1909, and 1916 sedition
 

Oracle openworld-presentation

  • 1. Oracle – Big Data THE INTELLIGENCE LIFE-CYCLE and Schema-Last Approach Dr Neil Brittliff PhD
  • 2. A little about myself…  Awarded a PhD at the University of Canberra in March this year for my work in the Big Data space  Currently employed as Data Scientist within the Australian Government  Have been employed by 5 law enforcement agencies  Developed Cryptographic Software to support the Australian Medicare System  First used Oracle products back in 1986  Worked in the IT industry since 1982  Resides in Canberra (capital of Australia)  Canberra is the only capital city in Australia that is not named after a person  Interests  Tennis (play) / Cricket (watch)  Bushwalking and camping  Piano Playing (very bad)  Making stuff out of wood  Enjoys the art of Programming (prefers the ‘C’ language)  Pushing the limits of the Raspberry Pi 2
  • 3. Talk Structure 3  Motivation  Principles and Constraints  Intelligence Life-Cycle  Collect & Collate  Analyse & Produce  Report & Disseminate  Motivation  Research  What is a Schema  The Problem with ETL  Data Cleansing verses Data Triage  A New Architecture  Oracle Big Data  The Schema-Last Approach  Indexing Technologies and Exploitation  User Reaction  Observations and Opportunities
  • 4. National Criminal Intelligence 4  The Law Enforcement community are also in the business of collecting and analysing criminal intelligence and data, and where possible, sharing that resulting information…  To do this, they need rich, contemporary, and comprehensive criminal intelligence…  The National Criminal Intelligence Fusion Capability, which brings together subject matter experts, analysts, technology and big data to identify previously unknown criminal entities, criminal methodologies, and patterns of crime.  Fusion capability identifies the threats and vulnerabilities through the use of data.  It brings together, monitors and analyses data and information from Customs, other law enforcement, Government agencies and industry to build an intelligence picture of serious and organised crime in Australia.
  • 5. Australian Institute of Criminology 5 • While many of the challenges posed by the volume of data are addressed in part by new developments in technology, the underlying issue has not been adequately resolved. • Over many years, there have been a variety of different ideas put forward in relation to addressing the increasing volume of data, such as data mining. Darren Quick and Kim-Kwang Raymond Choo Australian Institute of Criminology September 2014
  • 6. Objectives 6  Support the Australian Intelligence Criminal Model  Simple Interface to exploit the data  Data ingestion must be simple to do  and minimise transformation  Support the large variety of data sources  Fast ingestion and retrieval times  Enable exact and fuzzy searching  Support ‘Identity Resolution’  Support metadata  Main the data’s integrity  Preserve Data-Lineage/Provenance  Reproduce the ingested data source exactly! We don’t want this!
  • 7. The Intelligence Life-Cycle 7 Plan, prioritise & direct Collect & collate Report & disseminate Analyse & produce Evaluate & review
  • 8. Intelligence – Data Source Classification 8 Low 95% High 5% DATA SOURCE CLASSIFICATION Low High Collect&collateAnalyse&produce
  • 9. Some Definitions: 9 That a major problem for the data scientist is to flatten the bumps as a result of the heterogeneity of data. Jimmy Lin and Dmitriy Ryaboy. Scaling big data mining infrastructure: The twitter experience. Collect&Collate Schema is from the Greek word meaning ‘form' or ‘figure' and is a formal representation of data model which has integrity constraints controlling permissible data values. Data munging or sometimes referred to as data wrangling means taking data that’s stored in one format and changing it into another format.
  • 10. Analyse AnalyseStorage Schema Application 10 SchemaFirst Raw Data Triage Cleanse Raw Data Storage SchemaLast Schema Schema
  • 11. Data Cleansing … 11 Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. “Data cleansing is the process of analysing the quality of data in a data source, manually approving/rejecting the suggestions by the system, and thereby making changes to the data. Data cleansing in Data Quality Services (DQS) includes a computer-assisted process that analyses how data conforms to the knowledge in a knowledge base, and an interactive process that enables the data steward to review and modify computer-assisted process results to ensure that the data cleansing is exactly as they want to be done.” Microsoft: 2012 Collect&Collate
  • 12. Data Sources – Always Increasing 12 Gap Collect&Collate
  • 13. Data Cleansing - Doesn’t WORK 13 “Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems. ” David Ruppert. Inconsistency of resampling algorithms for high-breakdown regression estimators and a new algorithm. Journal of the American Statistical Association, 97: 148–149, 2002. “Formal data cleansing can easily overwhelm any human or perhaps the computing capacity of an organization.” N. Brierley, T. Tippetts, and P. Cawley. Data fusion for automated non-destructive inspection. Proceedings of the RSPA, 2014. “that the data volume may overwhelm the Extract Transform Load process and that data cleansing may introduce unintentional errors.” Vincent McBurney, 17 mistakes that ETL designers make with very large data, 2007. Collect&Collate
  • 14. Data Cleansing – Loss of Format 14 Input Date Cleansed Date Comment 20 July 2014 20-07-2014 Australian Date July-20-2014 20-07-2014 American Format (mmm-dd-yyyy) 2014-20-07 20-07-2014 Arabic Format (right to left) 20-07-14 20-07-2014 Data Ambiguity July 2014 01-07-2014 Imputed Value "If you torture the data long enough, it will confess.“ Clifton R. Musser Collect&Collate
  • 15. ETL vs Triage 15 Initiate Extract Determine Suitability? Transform n Assessment? Load Report Complete n Initiate Triage Load Suitability? Application n Verify? Fuse Resolve Complete n Collect&Collate
  • 16. We did our research … 16
  • 17. Oracle’s BDA (Big Data Appliance) 17 Collect&Collate
  • 18. Data Storage/Collation 18  Store the Data Semantically  Built on an defined taxonomy/ontology  Perfect to capture metadata  Searched for the perfect Triple-Store Subject Predicate Object Triple Graph List Collect&Collate
  • 19. The Architecture 19 Collect & Collate Analyse & Produce Set Store Hbase Historical Data New Data RDF/Modelling Feeds DataExploration SemanticStore Disseminate Index IIR Index SOLR BDA Palantir SearchAssistant Data Flow DataExploitation SPARQL R Language Apache PIG
  • 20. Schema Last … 20 ‘Triaged’ Data First Name Middle Name Last Name Schema Full-Name Street Number Street Name Suburb State Postcode Full-Address Collect&Collate Models
  • 21. ACC Search Engines – ‘Smackdown’ 21 Feature SOLR IIR License Apache License Commercial Storage Inverted List Third-party Database Support Google Like search  Next Release Score Model Inverse Document Frequency Normalized Score Result Pagination  Homophone Support Can use synonym support  Phoneme Search   Spread indexes across multiple nodes   Schema-less Support  Programming Interface Rest SOAP - API Geo-spatial   Collect&Collate
  • 22. Collect & Collation Tool 22 Collect&Collate
  • 23. Bongo – Exploration 23 Analyse&Produce
  • 24. Palantir – Semantic Interface 24 Report&Disseminate
  • 25. User Reaction 25 Time to Triage < 1 Hour > 1 Hour < 24 Hour General Size % - Megabytes < 1 • Developed a Palantir Plugin to search the Fusion Data Holding • Bulk Matching was a great success • In general, user reaction has been positive • Time to Triage was usually under an hour where cleansing could take weeks!!!
  • 26. Ingestion Rate – The Improvement 26 Collect&Collate
  • 27. Observations… 27  The Bulk Matcher  Performance and Reliability  Interaction with Palantir  Configuration over Customisation  Search for the ‘Single Source of Truth’  Golden Record  Acceptance of the Schema Last Approach  Overwhelmed by Search Results
  • 28. Further Reading and Contacts 28  Strategic Thinking in Criminal Intelligence Jerry H Ratcliffe The Federation Press – 2009 ISBN 978 186287 734-4  Intelligence-Led Policing Jerry Ratcliffe Routledge – 2008 ISBN 978-1-843292-339-8  Data Matching Concepts and Techniques and Record Linkage, Entity Resolution, and Duplicate Detection Peter Christen Springer – 2012 ISBN 978-3-642-31163-5  Foundations of Semantic Web Technologies Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph CRC Press – 2010 ISBN 978-1-4200-9050-5  Big Data – A revolution that will transform how we live, work, and think Viktor Mayer-Schönberger and Kenneth Cukier HMH – 2013 ISBN 978-0-544-00269-2  Sharma The Schema Last Approach to Data Fusion Neil Brittliff and Dharmendra Sharma The Schema Last Approach to Data Fusion AusDM 2014  A Triple Store Implementation to support Tabular Data Neil Brittliff and Dharmendra Sharma AusDM 2014 Australian Institute of Criminology http://www.aic.gov.au University of Canberra http://www.Canberra.edu.au

Editor's Notes

  1. Thanks Vladimir Videnovic Richard Foote Vicky Faulkner Dharmendra Sharma (my PhD supervisor) Introduction About myself – worked for 5 law enforcement agencies
  2. The AICM (The Australian Intelligence Criminal Model) These are the components will focus on: Collect & Collate Analyse & Produce
  3. Intelligence is an integral part of the ACC remit and used to identify new criminal and monitor existing known targets. The intelligence cycle is the process of developing unrefined data from multiple data sources then analyst the ’fused’ data sources. The ACC and many other law enforcement agencies see that Big Data enables the collection to store and process data at a unprecedented rate that is only going to increase. An integral process of the Intelligence cycle is the collection and processing of raw data. In addition, the the scale, complexity and changing nature of intelligence data can make it impossible to stay in front without the aid of technology to collect, process and analyze big data.
  4. he Australian Institute of Criminology is Australia's national research and knowledge centre on crime and justice.
  5. The Skywhale is a hot air balloon designed by the sculptor Patricia Piccinini as part of a commission to mark the centenary of the city of Canberra. It was built by Cameron Balloons in Bristol, United Kingdom, and first flew in Australia in 2013. The balloon's design received a mixed response after it was publicly unveiled in May 2013. The cost of the balloon and the arrangements under which it was funded also attracted criticism. The executive director of culture for the ACT Chief Minister’s directorate informed the media on 9 May that the balloon and its supporting website cost about $170,000. Documents released the next day showed that the total cost to the government of commissioning and operating The Skywhale over its lifespan will be $300,000, and the philanthropic Aranday Foundation will provide a further $50,000. Moreover, the balloon will remain the property of the Melbourne-based company Global Ballooning and only one flight was scheduled for Canberra 
  6. The intelligence life-cycle central focus is data and data exploitation. The intelligence life cycle begins with the identification of possible data source, the collection and collation of the data. The analysis and application of models upon this data. The production and dissemination of situation reports and finally an evaluation and review of the entire intelligence life-cycle. Hoover the life-cycle as it will be shown must deal with: messy and noisy data. structured, semi-structured, and unstructured data. tabular and highly linked data. Cross Industry Standard for Data Mining (CRISP-DM) The Cross-Industry Standard Process for Data Mining (CRISP–DM). CRISP–DM a given data mining project has a life cycle consisting of six phases, That is, the next phase in the sequence often depends on the outcomes associated with the preceding phase. There may be further data preparation phase for further refinement before moving forward to the model evaluation phase. The six phases are as follows: Business understanding phase . The first phase in the CRISP–DM standard process may also be termed the research understanding phase . Enunciate the project objectives and requirements clearly in terms of the business or research unit as a whole. Translate these goals and restrictions into the formulation of a data mining problem definition. Prepare a preliminary strategy for achieving these objectives. Data understanding phase Collect the data. Use exploratory data analysis to familiarize yourself with the data, and discover initial insights.
  7. Low Signal – Usually List data that has no criminal significance High Signal – The opposite list that may be significant to an investigation Variety was the real problem
  8. Schema is used to describe relational tabular, hierarchical or graph structures. Usually, schema is used to identify how the data is to be stored or transported. For sources without schema, such as files, there are few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (for example: the relational approach requires simple attribute values, referential integrity, et cetera) as well as application-specific integrity constraints. Data munging or sometimes referred to as data wrangling means taking data that’s stored in one format and changing it into another format. Analysts regularly wrangle data into a form suitable for computational tools through a tedious process that delays more substantive analysis. The are tools both interactive and command line that can assist data transformation, analysts must still conceptualize the desired output state, formulate a transformation strategy, and specify complex transforms. The `Schema-First' may mean a loss of data quality at any one of these stages and reduce the applicability. These include (Chapman, 2005): • Data capture and recording at the time of gathering. • Data manipulation prior to digitization (label preparation), identification of the collection and its recording. • Digitization of the data. • Documentation of the data (capturing and recording the meta-data). • Data storage and archiving. • Data presentation and dissemination (paper and electronic publications, web-enabled databases, et cetera). • Data use (analysis and manipulation). All these have an input into the final quality or `fitness for use' of the data and all apply to all aspects of the data – the taxonomic or nomenclature portion of the data – the `what', the spatial portion – the `where' and other data such as the `who' and the temporal `when'.
  9. Schema First Requires a Cleansing step Popular amongst data scientists Analysis will happen only on the Cleansed Data The organisation in question was utilising the Traditional Schema First Approach Utilises ETL Extract Transform Load technologies Cannot produce the data exactly Examples of Schemas are: AVRO ANS-1 XSD Schema Last Also known as schema on exploitation Requires no Data Cleansing Analysis can occur on the raw data Found this to be a more effective model than Schema-First
  10. A data cleaning approach should satisfy several requirements. First of all, it should detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort and be extensible to easily cover additional sources. Furthermore, data cleaning should not be performed in isolation but together with schema related data transformations based on comprehensive meta-data. Mapping functions for data cleaning and other data transformations should be specified in a declarative way and be reusable for other data sources as well as for query processing. Especially for data warehouses, a work-flow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way. As argued by David Ruppet an esteem member of the American Statistics Association pertaining to inconsistent results in relation to statistical sampling: (Ruppert, 2002) “Data cleansing can be time-consuming and tedious, but robust estimators are not a substitute for careful examination of the data for clerical errors and other problems.”
  11. The Gap between was forever increasing The management were not happy This did not address operational or tactical intelligence
  12. Human Cleansing Often the data cleansing is a manual process where a human manual trawls through the data and correct typographical or errors or makes some determination of what the data represents. The data for example may be a list of rate payers for a capital city. The rate pay may a household owner (owner occupier) or an organization. The council does not distinguish (or probably care) if the rate payer is an individual or organization as long as the rates are payed. This does however present a problem, where this may matter for an intelligence gathering is it David Jones the person or David Jones the organization. If this does matter to the intelligence gathering system then information or in particular entity resolution. Automated Cleansing Automated cleansing is whereby a set of automated rules are applied to the to the data and can modify, merge or split the data into a format suitable for ingestion. Regular expression are suitable to determine to match or extract parts contained within the string. A challenge with this approach within the Australian Crime Commission is the coercive powers do dictate to an organization that are required to supply the agency the data but cannot dictate the form or structure of the data. This poses a significant impost on any automated process, however the majority of the data does come in the form of a comma separated file (csv) which is relatively simple to automate. Some outliers are more difficult to process and may be impossible to automate. This process or technology is referred to as Extract, Transform and Load (ETL).
  13. It is easy to demonstrate that Loss of Format can ultimately lead to loss of data and that it is never a good idea to impute data if the data is missing. Other techniques for example determining the most likely candidate based on statistical methods should be avoided
  14. ETL may alter the data Triage keeps the raw data intact Triage may require data reformatting but no data transformation
  15. It is a very messy world It is only getting more complicated
  16. Note: Cloudera Path comes at no cost with BDA Berkley DB (NOSQL DB) has a cost
  17. The RDF triple store allows the storage and retrieval of any data structure and well suited to store the schema last artifacts. Therefore, the triple store can store both the data and schema structure. Ideally, a triple store graph would only contain a specific store’s data and schema structure.
  18. Analytics Architecture. It is not clear yet how an optimal architecture of an analytics systems should be to deal with historic data and with real-time data at the same time. An interesting proposal is the Lambda architecture of Nathan Marz. The Lambda Architecture solves the problem of computing arbitrary functions on arbitrary data in real-time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer. It combines in the same system Hadoop for the batch layer, and Storm for the speed layer. The properties of the system are: robust and fault tolerant, scalable, general, extensible, allows ad hoc queries, minimal maintenance, and debuggable. Concept of a Table exists. Each table has it's own key space. You can add and remove table as easily as a RDBMS. Uses binary keys. It's common to combine many different items together to form a key. Data Consistency by design. Offers a nice convenience method to increment counters. Very much suitable for data aggregation. Map Reduce support is native. HBase is built on Hadoop. Data does not get transferred. Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself. Comes with both a thrift/rest interface.
  19. We developed a simple but effective modelling system to map the triaged data. It was possible that the data was not in first normal form The Schema mapped to domains ESRI will only require the Address for geocoding
  20. Informatica IIR was based on a product SSANAME which was developed in Canberra. SOLR5 now provides a Schema-less support
  21. The thin client (Bongo) allowed an analyst to map the data for further processing . There is both a thin and fat client
  22. Unlike other triple store implementations, Sesame provides several reference implementations but allows other third party to provide additional alternates The Fusion Data Holding or FDH is a single repository created at the ACC to house the big data repository. There are a number of mandatory requirements that must be met for any design to succeed, which are: 1. Data must not be modified. 2. If the data was ordered - then the order could matter an must be retained. 3. The provenance of data is important and must be maintained and not lost through the data life-cycle. 4. The data must be able to be annotated which also must not be lost. 5. Data must be extracted in bulk quickly and efficiently.
  23. Palantir was the delivery mechanism by choice for data dissemination
  24. Success at last the