SlideShare a Scribd company logo
B
A
(alexj@businessabstraction.com)
It is not only Hadoop…
B
A
• Big Data are the new types of data that let go of the limitations
we had to impose decades ago due to the state of hardware
and software back then
• The main challenge is therefor unlearning said limitations,
and learning to incorporate Big Data capabilities and agility
into [policy] work
• Traditional reporting and BI works with “known knowns”. Big
data allows working with “known unknowns”, “unknown
knowns” and “unknown unknowns”.
• There are several distinctive types of technologies that fall
under the “Big Data“ moniker, which has their unique
capabilities: Hadoop, NOSQL, Semantic, Graph
2© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Consists of tables tightly packed with data, specific type per
row
• Tables identified and created in advance
• Tables populated from human input
• Tables used by filtering, grouping by rows, as well as
performing a limited number of joins, for reports, OLAP etc
• Text data are supposed to be read by people
3© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Data coming from all over the Internet
• Data from Internet of Things.
• Human circumstances
• XML structures
• Data come from someplace, designed by someone else
• Machine learning
• Clustering
• Graph algorithms
4© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Traditional for IT
• Fully defined data
• Traditional Database
• Master Data Model
• Data Warehouse
• New generation of ideas
and technologies
• Presumes only part of
information is known
• Internet
• Information across
multiple enterprises
• Information extracted
from texts
5© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
New generations of tools, often coming from Internet companies,
designed for “New Data”
• Hadoop File System
• NoSQL: Cassandra, MarkLogic, Couchbase, DynamoDB
• Column-store RDB
• Semantic DBs
• Graph DBs
• Map/Reduce of different flavours
• Xquery
• Sparql
• Gremlin
6© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Write anything associated with a primary key (akin to a file
path)
• Distributed over commodity servers
• Highly concurrent write and read
• Everything is cheap – hardware, “design” etc
• However, small records have to be stored in Sequence files or
Map Files
• Anything at scale – can store files in Petabytes
• Designed for Map/Reduce batch work, data lakes
• Anything interactive requires massive hardware
7© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
The term “NoSQL” means “not relational”, and as such covers a
lot of different models. Some of them are suitable for complexity
of generic data storage. They are called “semi-structured” as
although individual data items are structures, the structures are
not necessarily defined in advanced
NoSQL platforms combine Hadoop’s “store anything” capability
with indexing and
• store and index XML or JSON documents (“trees”).
• A deep row store can be seen as a document database where
depth of trees limited to 2..
• Tables with named fields per row
8© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• “Interactive Hadoop”
• Low-granularity Hadoop
• Data Lake
• Operations DBs with complex data
• Data consolidation
• Dynamic Data Warehouse
• Operational Data Warehouse
• Data presumed “forests” of “trees” – connected data are
handled not as good
• A touch more expensive than Hadoop
9© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
Provide traditional RDB interface in the new world
• Different internal structure
• Less suitable for OLTP
• Suitable for sparse data – empty fields don’t take space or
penalise for read
• Much faster for analytics, especially if only selected fields are
used
• Analytics when schema is known
• Cannot do schema-on-read
10© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
Support Resource Description Framework (RDF), originally
created for Semantic Web metadata. It stores information in
Subject-Predicate-Object “triples”, the most flexible
representation possible. Use Sparql for queries.
• Graph patterns
• Metadata for Hadoop/NoSQL. Lack of internal schema
requires external metadata
• Do not scale as much
• Hype-contaminated: people who understand enterprise and
understand Semantic Tech are rare
11© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
Graph Databases see data as one huge graph. They are
optimised for navigating the edges of the graph. Use Gremlin.
• Implementing Graph Analytics
• Bespoke graph logic
• Backend for general apps (if BASE jumping is too boring)
• Not as scalable as NoSQL
• Lack declarative data type, patterns & rules definitions of
Semantic DBs
• Depend on ability to build and maintain a graph
12© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
Platform for massively parallel computations, enables effective
sharing of workload between commodity servers.
• MapReduce
• YARN
• Apache Spark
• Batch jobs over massive data
• On-demand queries where some lag is acceptable
• Implementations have powerful Analytics/Machine Learning
libraries
• Latency
13© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Ensure datasets are identifiable
• Capture metadata
• Ensure your data are not lost
• Profile data across field names, structures etc
• Locate data as needed
• As you learn more about data, build up your metadata
• Hadoop
• NoSQL
14© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• A server in $1,000-$10,000 range
• 0.5TB – 25TB per server
• A lot of them if needed
• Doubling the number of servers reduces the time to execute
the task by the factor of 2.
15© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
Perhaps more complex than learning
• There are a lot of data you do not know about which is
available and can be used
• For many types of objects, it is natural to have uncommon
attributes
• Data storage is cheap. It doesn’t cost much to store everything
remotely related
• No massive pre-work.
• Ask everything
16© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Traditional reporting, BI
• Predictive analytics
• Data consolidation, Semantic Integration, Object-based
Intelligence
• Clustering
17© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Straightforward operation, no design upfront
• Can take immensely complex metadata, like UML & BPMN
models
• Apply OWL for classification
• SWRL builds complex linkages
• Refer to Classes defined by lower-level Ontologies rather than
data
18© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Words to be converted to tags (URLs)
• Some words have multiple meanings
• Ontology provides possible tags for nouns
• Software tries to resolve expected predicates
• The tag that can find necessary relations (predicates) wins
• Use ontology to restrict search
• Much more flexible than “foreign key”
19© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Information about data
• Traditional metadata was stored in form of data schema
• With schema-less storage, metadata should be stored
separately
• Incremental discovery process requires Open World
Assumption – you don’t know what other data are there.
• Reasoning to handle complexity
• Relationships as first-class citizens and the basis for
classification
20© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Work in progress – not there yet
• Different data paradigms mandate different views
• SQL view of Big Data (Apache PIG etc)
• Excel import
• Analytic visualisation frontends
• 30+ JavaScript libraries
• Presume development
• Mahoot & other libraries
• Writing code in Scala, Java, Python, Groovy
21© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Better picture of the current state
• What if prediction
• Researching impact
• Increasing the number of categories, by several orders of
magnitude if necessary
• Common, meaningful view of individual, organisation etc
• Prevention of undesirable effects on insights, complex events
and prediction
22© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
23© Copyright Business Abstraction Pty Ltd 2014-2015
B
A
• Description Logic, while using First-Order Predicate Logic
terminology
• Reduced for practical purposes
• Is not necessary to be productive
• Can be applied to anything
• Class can be derived depending on values
• A State is a Class
• New triples can be derived from existing
24© Copyright Business Abstraction Pty Ltd 2014-2015

More Related Content

What's hot

Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Semantic Web Application Development
Semantic Web Application DevelopmentSemantic Web Application Development
Semantic Web Application Development
Daniel Slamowitz
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
Splunk
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
DataWorks Summit/Hadoop Summit
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
The Convergence of Reporting and Interactive BI on Hadoop
The Convergence of Reporting and Interactive BI on HadoopThe Convergence of Reporting and Interactive BI on Hadoop
The Convergence of Reporting and Interactive BI on Hadoop
DataWorks Summit
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data Discovery
Harald Erb
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
Adam Muise
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Mohammed Guller
 
Microsoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsMicrosoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered Analytics
Juan Alvarado
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
Oleksii Movchaniuk
 
Company report xinglian
Company report xinglianCompany report xinglian
Company report xinglian
Xinglian Liu
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
DataWorks Summit
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
DataWorks Summit
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud Services
Jean Ihm
 
Scaling Face Recognition with Big Data
Scaling Face Recognition with Big DataScaling Face Recognition with Big Data
Scaling Face Recognition with Big Data
Bogdan Bocse
 

What's hot (20)

Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Semantic Web Application Development
Semantic Web Application DevelopmentSemantic Web Application Development
Semantic Web Application Development
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
 
The Convergence of Reporting and Interactive BI on Hadoop
The Convergence of Reporting and Interactive BI on HadoopThe Convergence of Reporting and Interactive BI on Hadoop
The Convergence of Reporting and Interactive BI on Hadoop
 
Big Data Discovery
Big Data DiscoveryBig Data Discovery
Big Data Discovery
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Microsoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered AnalyticsMicrosoft Power BI: AI Powered Analytics
Microsoft Power BI: AI Powered Analytics
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
 
Company report xinglian
Company report xinglianCompany report xinglian
Company report xinglian
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Beyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AIBeyond Big Data: Data Science and AI
Beyond Big Data: Data Science and AI
 
An Introduction to Graph: Database, Analytics, and Cloud Services
An Introduction to Graph:  Database, Analytics, and Cloud ServicesAn Introduction to Graph:  Database, Analytics, and Cloud Services
An Introduction to Graph: Database, Analytics, and Cloud Services
 
Scaling Face Recognition with Big Data
Scaling Face Recognition with Big DataScaling Face Recognition with Big Data
Scaling Face Recognition with Big Data
 

Viewers also liked

Suman Resume
Suman ResumeSuman Resume
Suman Resume
Suman Kumar
 
Internet senior
Internet seniorInternet senior
Internet senior
Cesar Pacheco Cid
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
jameschloejames
 
Experience 2015 Survey Results
Experience 2015 Survey ResultsExperience 2015 Survey Results
Experience 2015 Survey Results
Rachel E. Black
 
Jazmany Averos Zúñiga
Jazmany Averos ZúñigaJazmany Averos Zúñiga
Jazmany Averos Zúñiga
JULIO ALBERTO RENDÓN VERA
 
ELECTRONICS LAB WORK
ELECTRONICS LAB WORKELECTRONICS LAB WORK
ELECTRONICS LAB WORK
waqasahmad1995
 
Islaami Akhlaqo Aadaab
Islaami Akhlaqo AadaabIslaami Akhlaqo Aadaab
Islaami Akhlaqo Aadaab
Wajid Malik
 
Presentacion 1226169794271228-9
Presentacion 1226169794271228-9Presentacion 1226169794271228-9
Presentacion 1226169794271228-9
Cesar Pacheco Cid
 
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
Aquamarine Emerald
 
Kubernetes-Meetup
Kubernetes-MeetupKubernetes-Meetup
Kubernetes-Meetup
Vaibhav Kohli
 
EDUARDO HAZ SEGOVIA
EDUARDO HAZ SEGOVIAEDUARDO HAZ SEGOVIA
EDUARDO HAZ SEGOVIA
JULIO ALBERTO RENDÓN VERA
 
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha TrangDự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
Thuat Bui
 
Hợp đồng mẫu the cbd
Hợp đồng mẫu the cbdHợp đồng mẫu the cbd
Hợp đồng mẫu the cbd
Luyên Trần
 

Viewers also liked (14)

Suman Resume
Suman ResumeSuman Resume
Suman Resume
 
Internet senior
Internet seniorInternet senior
Internet senior
 
Presentation_NEW.PPTX
Presentation_NEW.PPTXPresentation_NEW.PPTX
Presentation_NEW.PPTX
 
Experience 2015 Survey Results
Experience 2015 Survey ResultsExperience 2015 Survey Results
Experience 2015 Survey Results
 
Jazmany Averos Zúñiga
Jazmany Averos ZúñigaJazmany Averos Zúñiga
Jazmany Averos Zúñiga
 
ELECTRONICS LAB WORK
ELECTRONICS LAB WORKELECTRONICS LAB WORK
ELECTRONICS LAB WORK
 
Islaami Akhlaqo Aadaab
Islaami Akhlaqo AadaabIslaami Akhlaqo Aadaab
Islaami Akhlaqo Aadaab
 
Presentacion 1226169794271228-9
Presentacion 1226169794271228-9Presentacion 1226169794271228-9
Presentacion 1226169794271228-9
 
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
PGBM01 - MBA Financial Management And Control (2015-16 Trm1 A)Pgbm01 workshop...
 
Kubernetes-Meetup
Kubernetes-MeetupKubernetes-Meetup
Kubernetes-Meetup
 
EDUARDO HAZ SEGOVIA
EDUARDO HAZ SEGOVIAEDUARDO HAZ SEGOVIA
EDUARDO HAZ SEGOVIA
 
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha TrangDự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
Dự Thảo hợp đồng mua bán căn hộ Panorama Nha Trang
 
Hợp đồng mẫu the cbd
Hợp đồng mẫu the cbdHợp đồng mẫu the cbd
Hợp đồng mẫu the cbd
 
UP_Redovno-Bak-EEEO
UP_Redovno-Bak-EEEOUP_Redovno-Bak-EEEO
UP_Redovno-Bak-EEEO
 

Similar to Understanding Big Data for policy professionals

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
Jeffrey T. Pollock
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
xKinAnx
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
Gwen (Chen) Shapira
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
Ashnikbiz
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
Dataconomy Media
 
Big Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with RiakBig Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with Riak
Caserta
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
Gustav Lundström
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
Bob Hardaway
 

Similar to Understanding Big Data for policy professionals (20)

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San JoseBig Data at Oracle - Strata 2015 San Jose
Big Data at Oracle - Strata 2015 San Jose
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
 
Big Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with RiakBig Data Warehousing Meetup with Riak
Big Data Warehousing Meetup with Riak
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

Recently uploaded

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Understanding Big Data for policy professionals

  • 2. B A • Big Data are the new types of data that let go of the limitations we had to impose decades ago due to the state of hardware and software back then • The main challenge is therefor unlearning said limitations, and learning to incorporate Big Data capabilities and agility into [policy] work • Traditional reporting and BI works with “known knowns”. Big data allows working with “known unknowns”, “unknown knowns” and “unknown unknowns”. • There are several distinctive types of technologies that fall under the “Big Data“ moniker, which has their unique capabilities: Hadoop, NOSQL, Semantic, Graph 2© Copyright Business Abstraction Pty Ltd 2014-2015
  • 3. B A • Consists of tables tightly packed with data, specific type per row • Tables identified and created in advance • Tables populated from human input • Tables used by filtering, grouping by rows, as well as performing a limited number of joins, for reports, OLAP etc • Text data are supposed to be read by people 3© Copyright Business Abstraction Pty Ltd 2014-2015
  • 4. B A • Data coming from all over the Internet • Data from Internet of Things. • Human circumstances • XML structures • Data come from someplace, designed by someone else • Machine learning • Clustering • Graph algorithms 4© Copyright Business Abstraction Pty Ltd 2014-2015
  • 5. B A • Traditional for IT • Fully defined data • Traditional Database • Master Data Model • Data Warehouse • New generation of ideas and technologies • Presumes only part of information is known • Internet • Information across multiple enterprises • Information extracted from texts 5© Copyright Business Abstraction Pty Ltd 2014-2015
  • 6. B A New generations of tools, often coming from Internet companies, designed for “New Data” • Hadoop File System • NoSQL: Cassandra, MarkLogic, Couchbase, DynamoDB • Column-store RDB • Semantic DBs • Graph DBs • Map/Reduce of different flavours • Xquery • Sparql • Gremlin 6© Copyright Business Abstraction Pty Ltd 2014-2015
  • 7. B A • Write anything associated with a primary key (akin to a file path) • Distributed over commodity servers • Highly concurrent write and read • Everything is cheap – hardware, “design” etc • However, small records have to be stored in Sequence files or Map Files • Anything at scale – can store files in Petabytes • Designed for Map/Reduce batch work, data lakes • Anything interactive requires massive hardware 7© Copyright Business Abstraction Pty Ltd 2014-2015
  • 8. B A The term “NoSQL” means “not relational”, and as such covers a lot of different models. Some of them are suitable for complexity of generic data storage. They are called “semi-structured” as although individual data items are structures, the structures are not necessarily defined in advanced NoSQL platforms combine Hadoop’s “store anything” capability with indexing and • store and index XML or JSON documents (“trees”). • A deep row store can be seen as a document database where depth of trees limited to 2.. • Tables with named fields per row 8© Copyright Business Abstraction Pty Ltd 2014-2015
  • 9. B A • “Interactive Hadoop” • Low-granularity Hadoop • Data Lake • Operations DBs with complex data • Data consolidation • Dynamic Data Warehouse • Operational Data Warehouse • Data presumed “forests” of “trees” – connected data are handled not as good • A touch more expensive than Hadoop 9© Copyright Business Abstraction Pty Ltd 2014-2015
  • 10. B A Provide traditional RDB interface in the new world • Different internal structure • Less suitable for OLTP • Suitable for sparse data – empty fields don’t take space or penalise for read • Much faster for analytics, especially if only selected fields are used • Analytics when schema is known • Cannot do schema-on-read 10© Copyright Business Abstraction Pty Ltd 2014-2015
  • 11. B A Support Resource Description Framework (RDF), originally created for Semantic Web metadata. It stores information in Subject-Predicate-Object “triples”, the most flexible representation possible. Use Sparql for queries. • Graph patterns • Metadata for Hadoop/NoSQL. Lack of internal schema requires external metadata • Do not scale as much • Hype-contaminated: people who understand enterprise and understand Semantic Tech are rare 11© Copyright Business Abstraction Pty Ltd 2014-2015
  • 12. B A Graph Databases see data as one huge graph. They are optimised for navigating the edges of the graph. Use Gremlin. • Implementing Graph Analytics • Bespoke graph logic • Backend for general apps (if BASE jumping is too boring) • Not as scalable as NoSQL • Lack declarative data type, patterns & rules definitions of Semantic DBs • Depend on ability to build and maintain a graph 12© Copyright Business Abstraction Pty Ltd 2014-2015
  • 13. B A Platform for massively parallel computations, enables effective sharing of workload between commodity servers. • MapReduce • YARN • Apache Spark • Batch jobs over massive data • On-demand queries where some lag is acceptable • Implementations have powerful Analytics/Machine Learning libraries • Latency 13© Copyright Business Abstraction Pty Ltd 2014-2015
  • 14. B A • Ensure datasets are identifiable • Capture metadata • Ensure your data are not lost • Profile data across field names, structures etc • Locate data as needed • As you learn more about data, build up your metadata • Hadoop • NoSQL 14© Copyright Business Abstraction Pty Ltd 2014-2015
  • 15. B A • A server in $1,000-$10,000 range • 0.5TB – 25TB per server • A lot of them if needed • Doubling the number of servers reduces the time to execute the task by the factor of 2. 15© Copyright Business Abstraction Pty Ltd 2014-2015
  • 16. B A Perhaps more complex than learning • There are a lot of data you do not know about which is available and can be used • For many types of objects, it is natural to have uncommon attributes • Data storage is cheap. It doesn’t cost much to store everything remotely related • No massive pre-work. • Ask everything 16© Copyright Business Abstraction Pty Ltd 2014-2015
  • 17. B A • Traditional reporting, BI • Predictive analytics • Data consolidation, Semantic Integration, Object-based Intelligence • Clustering 17© Copyright Business Abstraction Pty Ltd 2014-2015
  • 18. B A • Straightforward operation, no design upfront • Can take immensely complex metadata, like UML & BPMN models • Apply OWL for classification • SWRL builds complex linkages • Refer to Classes defined by lower-level Ontologies rather than data 18© Copyright Business Abstraction Pty Ltd 2014-2015
  • 19. B A • Words to be converted to tags (URLs) • Some words have multiple meanings • Ontology provides possible tags for nouns • Software tries to resolve expected predicates • The tag that can find necessary relations (predicates) wins • Use ontology to restrict search • Much more flexible than “foreign key” 19© Copyright Business Abstraction Pty Ltd 2014-2015
  • 20. B A • Information about data • Traditional metadata was stored in form of data schema • With schema-less storage, metadata should be stored separately • Incremental discovery process requires Open World Assumption – you don’t know what other data are there. • Reasoning to handle complexity • Relationships as first-class citizens and the basis for classification 20© Copyright Business Abstraction Pty Ltd 2014-2015
  • 21. B A • Work in progress – not there yet • Different data paradigms mandate different views • SQL view of Big Data (Apache PIG etc) • Excel import • Analytic visualisation frontends • 30+ JavaScript libraries • Presume development • Mahoot & other libraries • Writing code in Scala, Java, Python, Groovy 21© Copyright Business Abstraction Pty Ltd 2014-2015
  • 22. B A • Better picture of the current state • What if prediction • Researching impact • Increasing the number of categories, by several orders of magnitude if necessary • Common, meaningful view of individual, organisation etc • Prevention of undesirable effects on insights, complex events and prediction 22© Copyright Business Abstraction Pty Ltd 2014-2015
  • 23. B A 23© Copyright Business Abstraction Pty Ltd 2014-2015
  • 24. B A • Description Logic, while using First-Order Predicate Logic terminology • Reduced for practical purposes • Is not necessary to be productive • Can be applied to anything • Class can be derived depending on values • A State is a Class • New triples can be derived from existing 24© Copyright Business Abstraction Pty Ltd 2014-2015

Editor's Notes

  1. For 3 decades, we were treating you badly Don’t ask too many questions Don’t ask complex questions We’ve heard you, wait for your answer, patience Tell in advance all kind of questions you are going to ask. If you fail to do so, we will be calling your names behind your back We know other people have other data, but that is what we have, and that is what we will live with.
  2. Semantic – meaning. Understanding the meaning of data. Understanding means reasoning, creating mew meanings from old ones. Graph patterns
  3. Variable? Which concept is it related to? What classification has been used? What encoding?