● Big Data views
○ Scientific Method
○ Data Characteristics
○ New Technology
○ Business Opportunities
● Opportunities for BI professionals
The famous McKinsey Report: Big data: The
next frontier for innovation, competition, and
BIG Data became trending because of Mckinsey
Now it’s correlated with hadoop
Wikipedia Big Data
Big data usually includes data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage, and process the data within a
tolerable elapsed time.
Big data sizes are a constantly moving target, as of 2012 ranging from a few
dozen terabytes to many petabytes of data in a single data set.
The target moves due to constant improvement in traditional DBMS technology
as well as new databases like NoSQL and their ability to handle larger amounts
With this difficulty, new platforms of "big data" tools are being developed to
handle various aspects of large quantities of data.
Focus on volume… instead of other V’s
The Fourth Paradigm: Data-Intensive
Increasingly, scientific breakthroughs will
be powered by advanced computing
capabilities that help researchers
manipulate and explore massive datasets.
Implicit in the idea of a fourth paradigm is
the ability, and the need, to share data. In
sciences like physics and astronomy, the
instruments are so expensive that data
must be shared
Data analysis is the new microscope
Human Genome, Large Hydron Collider
Thousand years ago: science was
empirical describing natural
Last few hundred years: theoretical
branch using models, generalizations
Last few decades: a computational
branch simulating complex
Today:data exploration (eScience)
unify theory, experiment, and
○ Data captured by instruments
Or generated by simulator
○ Processed by software
○ Information/Knowledge stored
○ Scientist analyzes database /
files using data management
On Sunday, January 28, 2007, during a short solo sailing trip to the Farallon Islands near San
Francisco to scatter his mother's ashes, Gray and his 40-foot yacht, Tenacious, were reported
missing by his wife, Donna Carnes. The Coast Guard searched for four days using a C-130
plane, helicopters, and patrol boats but found no sign of the vessel.
Gray's boat was equipped with an automatically deployable EPIRB (Emergency PositionIndicating Radio Beacon), which should have deployed and begun transmitting the instant his
vessel sank. The area around the Farallon Islands where Gray was sailing is well north of the
East-West ship channel used by freighters entering and leaving San Francisco Bay. The
weather was clear that day and no ships reported striking his boat, nor were any distress radio
On February 1, 2007, the DigitalGlobe satellite did a scan of the area, generating thousands of
images. The images were posted to Amazon Mechanical Turk in order to distribute the work
of searching through them, in hopes of spotting his boat.
In the immediate aftermath of the disappearance, many theories were put forward on how
On February 16, 2007, the family and Friends of Jim Gray Group suspended their search,
but continue to follow any important leads. The family ended its underwater search May 31,
2007. Despite much effort and use of high-tech equipment above and below water, searches
did not reveal any new clues.
While at Berkeley, Gray and his first wife Loretta had a daughter; the couple later divorced.
He is survived by his wife, Donna Carnes, his daughter, three grandchildren, and his sister
The University of California, Berkeley and Gray's family hosted a tribute to him on May 31,
2008. The conference included sessions delivered by Richard Rashid and David Vaskevitch.
Microsoft's WorldWide Telescope software is dedicated to Gray. In 2008, Microsoft opened
a research center in Madison, Wisconsin, named after Jim Gray.
Having being missing for five years as of May 16, 2012, Gray is legally assumed to have died
Jim Gray Award
Each year, Microsoft Research presents the Jim Gray eScience Award to a researcher who
has made an outstanding contribution to the field of data-intensive computing. Award
recipients are selected for their ground-breaking, fundamental contributions to the field of
eScience. Previous award winners include Alex Szalay (2007), Carole Goble (2008), Jeff
Dozier (2009), Phil Bourne (2010), Mark Abbott (2011) and Antony John Williams (2012).
Transaction Processing: Concepts and Techniques (with Andreas Reuter) (1993).
The Benchmark Handbook: For Database and Transaction Processing Systems
(1991). Morgan Kaufmann. ISBN 978-1-55860-159-8.
This is a world where massive amounts of data and applied mathematics replace every
other tool that might be brought to bear. Out with every theory of human behavior, from
linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why
people do what they do? The point is they do it, and we can track and measure it with
unprecedented fidelity. With enough data, the numbers speak for themselves.
There is now a better way. Petabytes allow us to say: "Correlation is enough." We can
stop looking for models. We can analyze the data without hypotheses about what it might
show. We can throw the numbers into the biggest computing clusters
the world has ever seen and let statistical algorithms find patterns
where science cannot.
The end of theory:
Cukier and MAyer-Schonberger
Shift 1: End of Samples
Shift 2: End of exactitude
Shift 3: End of Causality
patterns & correlations
if you know that your customers are going to buy more products
by analyzing a data set or correlation, then the “why” doesn’t matter
— you should try to exploit that.
The technical equivalent in big data is the ability to survey a whole population instead
of just sampling random portions of it.
with less error from sampling we can accept more measurement error”. According to
the authors, science is obsessed with sampling and measurement error as a
consequence of coping in a ‘small data’ world.
The third and most radical shift implies “we won’t have to be fixated on causality [...]
the idea of understanding the reasons behind all that happens.” This is a straw
“We're not that much smarter than we
used to be, even though we have much
more information - and that means the
real skill now is learning how to pick out
the useful information from all this noise.”
“I came to realize that prediction in the era
of Big Data was not going very well.”
“If the quantity of information is increasing
[exponentially]… Most of it is just noise.”
“… numbers have no way of speaking for
themselves. We speak for them.”
Nate Silver has lived a preposterously interesting life. In 2002, while toiling away as a
lowly consultant for the accounting firm KPMG, he hatched a revolutionary method for
predicting the performance of baseball players, which the Web site Baseball
Prospectus subsequently acquired. The following year, he took up poker in his spare
time and quit his job after winning $15,000 in six months. (His annual poker winnings
soon ran into the six-figures.)
Big Data is bullshit
This is the tragedy of big data: The more
variables, the more correlations that can show
significance. Falsity also grows faster than
information; it is nonlinear (convex) with respect
It is an outlier, as it lies outside the realm of
regular expectations, because nothing in the past
can convincingly point to its possibility.
It carries an extreme 'impact'.
in spite of its outlier status, human nature makes
us concoct explanations for its occurrence after
I am not saying here that there is no information
in big data. There is plenty of information. The
problem — the central issue — is that the
needle comes in an increasingly larger
the fact, making it explainable and predictable.
A small number of Black Swans explains almost
everything in our world, from the success of ideas and
religions, to the dynamics of historical events, to
elements of our own personal lives.
The discovery of the Higgs particle was a dissapointment for some physicist because now they
know what they don’t know: no big things to discover
The ludic fallacy is a term coined by Nassim Nicholas Taleb in his 2007 book The Black
Swan. "Ludic" is from the Latin ludus, meaning "play, game, sport, pastime." It is
summarized as "the misuse of games to model real-life situations." Taleb explains the fallacy
as "basing studies of chance on the narrow world of games and dice."
It is a central argument in the book and a rebuttal of the predictive mathematical models used
to predict the future – as well as an attack on the idea of applying naïve and simplified
statistical models in complex domains. According to Taleb, statistics works only in some
domains like casinos in which the odds are visible and defined. Taleb's argument centers on
the idea that predictive models are based on platonified forms, gravitating towards
mathematical purity and failing to take some key ideas into account:
It is impossible to be in possession of all the information.
Very small unknown variations in the data could have a huge impact. Taleb does
differentiate his idea from that of mathematical notions in chaos theory, e.g. the
Theories/Models based on empirical data are flawed, as they cannot predict
events that have never happened before, but have tremendous impact. E.g. the
911 terrorist attacks, invention of the automobile, etc.
Discover what you (don’t) know you
Data integration is already 20+ years old
Just another source
We do not have much data
Small or big data: it has to be managed
Big data = business analytics
One-off projects (data is too varied)
We know what data is all about. Nobody has to tell us what you can do with data.
Gartner’s definition (2001)
Big Data is high-volume, high-velocity, and/or high-variety information assets that require
new forms of processing to enable enhanced decision making, insight discovery and
Volume: relative size of data sources
Velocity: speed at which data refresh is handled
Variety: handling various data formats
(Validity, Veracity( accuracy, correctness, applicability), Value, and Visibility)
“Information was a pond
and has become a river”
fantastiche leuke spreker op het SAS forum. goede presentatie : filtering wordt/is heel
om data actionable te houden moet er instant gerageerd worden. . vissen in een meer
versus vissen in een rivier. zoveel water dat snel voorbij stroomt
The true godfather of Data
Human Sourced Information
○ is now largely digitized and
electronically stored everywhere
from tweets to movies
○ This data includes transactions,
reference tables and
relationships, as well as the
metadata that sets its context, all
in a highly structured form.
○ from simple sensor records to
complex computer logs
Impact on the DWH
The central core business data pillar
is the consistent, quality-assured
data found in EDW and MDM
Deep analytic information requires
highly flexible, large scale
processing such as the statistical
analysis and text mining
Fast analytic data requires such
high-speed analytic processing that
it must be done on data in-flight,
Specialty analytic data, using
specialized processing such as
NoSQL, XML, graph and other
databases and data stores
inmon richt zich nu op deep analytic information met zijn text mining
Other BIG data related trends
A NoSQL database provides a mechanism for storage and retrieval of data that employs less
constrained consistency models than traditional relational databases. NoSQL systems are also
referred to as "Not only SQL" to emphasize that they do in fact allow SQL-like query languages to be
Document: MongoDB, Couchbase
Key-value : Dynamo, Riak,
Graph: Neo4J, Allegro,
Nosql: Mongo DB
How and Why Leading Investment Organizations are Migrating to MongoDB
Real World MongoDB: Use Cases from Financial Services
How Financial Firms Create Single Customer Views Using MongoDB
How Banks Use MongoDB to Manage Risk
How Banks Manage Reference Data with MongoDB
How Banks Use MongoDB
as a Tick Database
Position and Trade Management
Nodes represent entities
Properties are pertinent
information that relate to nodes.
Edges are the lines that connect
nodes to nodes or nodes to
properties and they represent the
relationship between the two
Ooh/aah strategy: first be amazed then understand
Local intelligence: ORTEC/TSS
Ortec Team Support Systems (ORTEC TSS),
develops decision, support & information ICTSystems to analyze sport performances.
These software systems are employed before,
during and after sport matches. During a match,
they are used to measure teams’ and players’
Following top athletes and talents by their clubs,
teams, sponsors, unions and the public has
been brought to a whole new dimension
because of these systems.
Elastic cloud: Amazon Redshift
$999 per TB per year
$999 per TB per year
ecosystem isn’t stable. A lot of configurations are possible
Hadoop is complex. Java expertise.
Apache Hadoop : Open source Hadoop framework in Java.
Consists of Hadoop Common Package (filesystem and OS
abstractions), a MapReduce engine (MapReduce or YARN),
and Hadoop Distributed File System (HDFS)
Apache Mahout : Machine learning algorithms for
collaborative filtering, clustering, and classification using
Apache Hive : Data warehouse infrastructure for Hadoop.
Provides data summarization, query, and analysis using a
SQL- like language called HiveQL. Stores data in an
embedded Apache Derby database.
Apache Pig: Platform for creating MapReduce programs
using a high-level “Pig Latin” language. Makes MapReduce
programming similar to SQL. Can be extended by user
defined functions written in Java, Python, etc
Apache Avro: Data serialization system. Avro IDL is the
interface description language syntax for Avro.
Apache HBase: Non-relational DBMS part of the Hadoop
project. Designed for large quantities of sparse data (like
BigTable). Provides a Java API for map reduce jobs to
access the data. Used by Facebook.
Apache ZooKeeper : Distributed configuration service,
synchronization service, and naming registry for large
distributed systems like Hadoop.
Apache Cassandra: Distributed database management
system. Highly scalable.
Apache Ambari: A web-based tool for provision, managing
and monitoring Apache Hadoop cluster
Apache Chukwa: A data collection system for managing large
Apache Sqoop: Tool for transferring bulk data between
structured databases and Hadoop
Apache Oozie: A workflow scheduler system to manage
Apache Hadoop jobs
For big data, 2013 is the year of experimentation and early deployment," said Frank
Buytendijk, research vice president at the research firm. "Adoption is still at the early
stages with less than 8 percent of all respondents indicating their organization has
deployed big data solutions. [Across the board], 20 percent are piloting and
experimenting, 18 percent are developing a strategy, 19 percent are knowledge
gathering, while the remainder has no plans or don't know."
Has "Big Data" significantly changed
Data Science principles and practice?
kdnuggets poll (Oct 29, 2013.)
Analytics is BIG
analytics is hotter. green line is google analytics: blue line should be corrected for that
Platform for predictive analytics competitions
Business hands over part of the data and keeps part of the data sets
Contenders build models based on the available data
Contenders predict the values of the kept data sets
Best prediction wins the competition
A global hydrological model will provide the international community with the best
possible estimates of the state of water resources in the world.
Assimilation of remotely sensed and in situ data will be a major mathematical
and computational challenge.
A successful implementation of the project will lead to a community model for
hydrologists across the globe.
- See more at: http://esciencecenter.nl/projects/project-portfolio/watermanagement/#sthash.Pj7kDbBI.dpuf
“Perhaps the most important cultural trend
today: The explosion of data about every
aspect of our world and the rise of applied math
gurus who know how to use it.”
Since Silk first came out of stealth mode in 2011, there have been 300,000 interactive
pages created on its cloud-based, web data-crunching platform designed for nontechnical “knowledge workers.” Taking less easy-to-read data sets and making them
more digestible, results have ranged from the Guardian newspaper in the UK creating
graphics of which countries have the most asylum seekers, through to charting what
products Google has killed and dads mapping out the best playgrounds for his kid in
Amsterdam (where Silk also happens to be founded). It’s been a popular, and free,
tool, with pages created by some 16,000 people growing by 20 percent each month.
Now, Silk is moving on to its next phase: its first paid product, Silk for Teams, aimed
at groups of enterprise users who want to use the platform to produce cleaner internal
data sets, and eventually to create data visualizations that work with paywalls.
“Our research suggests that seven sectors
alone could generate more than $3 trillion a
year in additional value as a result of open
Open data: Unlocking innovation and performance with liquid information
A new McKinsey report says that open data can help create $3 trillion a year of
economic value across seven sectors. In a related podcast, the McKinsey Global
Institute’s Michael Chui discusses the economic
Combining all the sources of this and the previous 3 slides and finding correlations is
the essence of (big) data analytics.
example: combining sunpower with sleepcycle and fitness and diet
“The ability to take data — to be able to
understand it, to process it, to extract value
from it, to visualize it, to communicate it — that’
s going to be a hugely important skill in the next
“The illiterate of the 21st century will not be
those who cannot read and write, but those
who cannot learn, unlearn, relearn”
Mckinsey report highlights
A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in
statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big
data… Furthermore, this type of talent is difficult to produce, taking years of training in the case of someone with intrinsic
mathematical abilities. (p.10)
Applying varying degrees of statistics, data visualizations, computer programming, data mining,
machine learning, and database engineering to solve complex data problems.
Association rule learning
Data Fusion and Integration
Supervised and Unsupervised
Natural Language Processing
Time Series Analysis
Typical Big Data Job is not a BI Job
JOB OPENING: BIG DATA ARCHITECT
We are looking to expand our core product team with a Senior Java Developer/Architect that will contribute in the
product design and development and take pride in the delivery of kick-a** products.
Knowledge, Skills and Experience
Minimum 4 years Java experience
Experience with NoSQL Databases, preferably MongoDB (MapReduce, Sharding)
Experience with Cloud-based infrastructure, esp. AWS
Expertise with Hadoop eco-system is a plus (examples: Flume, Zookeeper, Ganglia, etc)
Experience with Web services (REST/SOAP)
Obsession with performance and big data
Passion for elegant technical design and good programming practices (TDD, CI)
Energetic “self-starter” , have the will to take ownership, and be accountable for deliverables
A true defender of quality and (light-weight) documentation of the designs
Relevant HBO/University education or experience
Sense of humor is essential
Not typical BI
○ Just sell your personal data
○ Wait untill the big DM companies incorporate Hadoop ecosystem
○ Learn java and the hadoop ecosystem
○ Learn Python/R
○ Learn statistics and all kinds of algorithms (especially Bayes)
○ Learn the principles of hadoop/nosql
○ Learn how to integrate (big) data in the enterprise dwh
○ data governance/ data stewardship/ DQ / metadata
BI(g) Tool Specialist
○ Adopt a big data dataviz or reporting tool (Splunk, Platfora)
○ Adopt a platform (Cloudera, Hortonworks, MapR, Azure, Google, Amazon)
○ Data visualization tools, design info graphics
Data story teller
○ data journalism course
○ Explore platforms
○ Explore tools
Open data for personal and group branding
○ Start a project
○ Join open data sites
○ Start a blog/join a blog
○ Make news with data
○ Scanning business cases
○ Almere Datacapital
Group Activities BI United