SlideShare a Scribd company logo
1 of 18
be3
Big Data in a Box!
Part – I : Comprehending the Landscape
By
Kalyana Chakravarthy Kadiyala
Contact Info:
Tweet – #ganaakruti
Email – kckadiyala@gmail.com
LinkedIn – http://www.linkedin.com/in/kadiyalakc/
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Table of Contents
1.Disclaimer.....................................................................................................................................3
2.Foreword.......................................................................................................................................4
3.Datanomics – The Science behind Big Data..................................................................................5
4.Big Data – Technical Enumeration.................................................................................................7
4.1 How much big is Big?.............................................................................................................7
4.2 Handling the Big.....................................................................................................................8
4.3 Lego talk – Key Building blocks...............................................................................................9
4.4 Controls to Throttle...............................................................................................................10
5.Big Data – Data-driven Infotainment...........................................................................................12
5.1 Infotainment – Defining Moment..........................................................................................12
5.2 Enterprises – What do they do?............................................................................................12
5.3 Role of an Analyst.................................................................................................................13
5.4 Netizens – Ideal State...........................................................................................................14
6.Next Steps – Breaking it down!...................................................................................................15
6.1 Guide wire............................................................................................................................15
6.2 Outline..................................................................................................................................16
7.Conclusion..................................................................................................................................17
8.Keywords....................................................................................................................................17
9.Bibliography................................................................................................................................18
Illustration Index
Illustration 1: Connected World – Digitized Human Interactions........................................................5
Illustration 2: Chaos Condition - Interaction Derailment....................................................................6
Illustration 3: Big Data Challenges - a visual perspective!................................................................7
Illustration 4: Focus Areas – Big Data Solution Design......................................................................9
Illustration 5: Moving Parts - Layered Approach..............................................................................10
Illustration 6: Key Throttles to control!............................................................................................10
Illustration 7: Data-Driven Infotainment..........................................................................................12
Illustration 8: Enterprise Infotainment.............................................................................................13
Illustration 9: Connected World Experience - “All things Me”..........................................................14
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
1. Disclaimer
Below are few pointers to help drive your expectations about this article. Idea here is to let you form a
baseline context and explore the information according to your consumption appetites.
• What is it about? It is an article about Big Data, covering aspects since its age of prominence.
• Why now? There is lot of buzz, jargon and variants of same tool kits as well as, solutions alike.
• How is it approached? We will identify, assert and map key domain coordinates, as the first step.
Subsequently, will try to whet them using several hypothesis to understand data and platforms.
• Who are the target Audience? For a intrigued mind, with an urge to explore the Big Data landscape.
Where, the approach desired will be to reconcile, learn and experiment. The objective of such
experiments will be filling in a given data puzzle using relevant tools and techniques.
• Gotchas! While few are deep technical narrations, most parts focus on functional aspects. You can
ignore the technical aspects, if you don't want to read those sections.
• Handling? Elucidation approach both visual as well as verbal cues. Primarily emphasizing upon the
aspect of 'why', and how each stakeholders segment would view them.
• Objective – It is multi-faceted, as described below:
◦ Gain a comprehensive understanding about the Domain
◦ Gather essential coordinates to navigate the Landscape
◦ Set stone to build a Big Data Platform to run controlled big data science experiments
◦ Solutions are built with an attempt to draw unit metrics
◦ Once built, we want to take our experiments to environments with larger capacities.
• Why this approach? This is a validation approach as we consume different topics, tool kits and
implementation concepts. This requires an approach that is agnostic to any connected environment or
platform. Hence this approach, where we can travel beyond mere hello world intros.
• Is this a single shot? No, the idea here is to have a sequel attempt. Such that we can refine and enrich
our knowledge iteratively, taking one aspect at a time and covering it to its full extents. Again, the
idea is not to be a jack of all trades! We want to get comfortable in solving specific use cases where
data remains the quintessential pivot point.
• Note on Consumption
◦ Absolute No's – plagiarism is prohibited by all means and conditions.
◦ Questions – approach the Author.
◦ Criticism – welcome by all means. It helps fail early than fail long and later!
◦ Feedback – log through channels such as tweets, email, and wall posts as indicated on first page.
◦ Terms & Conditions – Author has adopted the GPL license. Terms and conditions in there, apply
here too!
• Last, but not least
◦ Thoughts and concepts that are presented and touched in different parts of the articulation, have
no resemblance to either professional associations or responsibilities to any of the professional
associations or responsibilities that author is currently engaged in.
◦ We draw parallels to most commonly seen and experienced use cases in parts or as a whole.
◦ Apply our learnings – technical and functional with an attempt to reconcile the understanding.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
2. Foreword
The wave of Big Data is still riding the markets, in its high peaks, sweeping across various Industries and
Geographies. Deep within the tech-dungeons, the effects of euphoria still linger over many minds. Select
groups had first hand opportunity in experiencing the phenomenon. Examples including those involved in
having to deal with web-scale or machine-scale data. The digital footprints we see today are net effects of
technological advances during past decade, only exploding what existed in mainstream enterprise before
then. The important aspect of this is not being a witness to the evolution. But, be positioned at the epicenter
and experience the events by essence, rather by mere tastes. All this occurred long before the lexicon went
viral to gain its current prominence. Others gazed and still gaze at it with amusement.
Current trend had its genesis at least a decade ago. During the high days of Telecommunications and
the Internet revolution, in general. Key contributors to the innovation that we see and experience today,
include players from verticals such as Digital Marketing, Information Search, Telecommunications and Social
Media (relatively a new comer). Later supporters include those in the Hi-tech and lately folks from Genomics
and other branches of Biology. We now have several platforms to support management of these
exponentially large digital footprints; many more to come. There are tool kits that enable us to tap into
insights the footprints have to reveal. Quality and success of such toolkits and the platforms is measured by
their effective handling of noise levels in the data streams. Key here is the quintessential leverage or vantage
point within the context and sensitive to the time frame in which its value is relevant.
Age of prominence is probably about five years now. Reason we get drawn to it is because we seek
betterment. Betterment in our overall Socio-Economic condition as well as how we engage in interactions.
The dynamics have changed by far. It all starts with an inquisitive impulse, gradually settling in as an essential
need than a desire. This resulted in emergence of a new breed of humans – Netizens. At least within the
Urban realms. Many more are being touched by the spread of undercurrents to new territories. Enterprises
are driving the adoption, primarily for their own self reasons such as heed the competition, sustain
bottom-lines and accelerate top-line performance.
With adoption rates on the rise, the Information highways are seeking expansion in their transport and
delivery capabilities. Also included is their efficiency to contain and process information as it moves between
producers and consumers. Phrases such as text me, tweet me, post it to my wall, customer-360, sentiments,
emotions and so forth are now considered part of common expressions. Data nuggets once set in motion get
transformed, trans-morphed and even consumed in varying contexts. Related technical terms include Noise,
Signals, Data Science, Data Mining, Data Modeling, Data Streams, etcetera!
This article is an attempt to comprehend the present day's highlights of the Big Data landscape. As we
map essential coordinates, we also set stone on next steps that focus on gaining some practical exposure.
We will begin by describing the factors that drove the phenomenon so far. Then try to paraphrase the subject
in a more logical sense, using both technical and functional aspects. Finally, provide a perspective on how
each stakeholder categories are effected by the under currents. Specially from the aspect of – what it means
to the bottom line aka., socio-economic conditions of each participating stakeholder. As part of this effort we
use terms such as Big Data, Voids – Data & Digital, Datanomics, Socio-Economics, Limitations of human
cognitive capabilities, etc. Details are presented using verbal descriptions, that are supported by visual cues,
appropriate enough to give a high level context.
All theory and no practicals is not a good way of learning either. It's a fair question to ask for a practical
exposure. In subsequent articles we will dwell into specific aspects. Go beyond abstractions, pick some
specific use cases and try them using a Big Data Box. We will build the platform in the due course. Canonical
title given to this complete attempt is – “be3: Experimenting Big Data in a Box!”.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
3. Datanomics – The Science behind Big Data
Datanomics – another gibberish as it may sound, it is the science behind All things Big Data. It is a lexicon
derived from two other famous words – the Data and the Economics. Data as most of us know, is a nugget of
information that can describe an entity or an event either in parts or as a whole. Economics is a Social Science
subject that helps us to visualize how well our economies function. It emphasizes upon studying the patterns
associated with production, distribution and consumption of various Goods and Services with some
positional value.
As social dwellers, we are among those entities that participate, interact and contribute to the overall
economic functions and outcomes. Few are personal, while few are more generic and apply to common-core.
With digitization the trails of our participation are now held by machines (tools and avatars). Commonly
referred buzz word here is – Hi-tech. Digital footprints as we know are mere electronic traces locked inside
several silos (specific and contextual, single version of truth as well as snapshots). These traces when
analyzed together, can reveal patterns and insights that can prove very important. One can invent and
innovate potential opportunities to better our overall socio-economic function. Also, allow us to foresee
quintessential future course corrections, to avert any adverse casualties.
Preceding visual describes the connected and integrated state of our digitized lives, with reasons to
quote. Interaction experiences that we humans gain are very sensitive to the situations. Situations that we
encounter, participate, interact and respond on an ongoing basis. Few will have short term effects, while
others last long. Such effects can either be positive, negative or neutral in their application and relevance.
The takeaways are very much contextual. For example, an Enterprise Organization likes to gain better
conversion rates to increase their current market share, retain customers, sustain bottom lines, achieve
top-line performance, and so forth. All these require access to insightful information, when in need and in
shortest optimal time to response. Factors such as confidence and accuracy are very much hinged on one
critical dimension – time or timing. Analytics is the key here.
However, the changing dynamics of our socio-economic conditions are demanding us to adapt and
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 1: Connected World – Digitized Human Interactions
01110101101011
0111010110
0111010110 0111010110 0111010110
0111010110
0111010110
0111010110 01110101100111010110capturecapture
learnlearn applyapply
refinerefine
re-use or re-purpose
relevance
timinglocation
outcomelanguage
comprehension
presentation
non-factualfactual
situation
innovate at a much faster rate. There are two primary reasons to this requirement. First one being, pertinent
facts are now generated at machine scale. Such scale is much faster than a normal cognitive human brain can
register and process soon the fact is surfaced. Second, due to the inherent latencies and the deviation in time
relevance, much of those facts end up being noise in the stream.
Success of Analytics is hedged on contextual relevance of insights they produce, given space and time
dimensions as key variables. Converting information into insights can be challenging, even when the context
is kept constant. Also, when the insights are served there are varying levels of abstracts that each consumer
can handle and tolerate. Important catalyst here is human cognition. Few require detailed insights, while
others are okay with the gist itself. Mode of representation and communication effects the ability to grasp.
Information can be represented and exchanged in either verbal, non-verbal, visual, or as a combination of
these.
Said that, the field of Information Technology is going through a major shift in its course of evolution.
There are now many tools and choices. This is true for both Enterprises and Individuals. With changes in the
underlying digital avenues, the size of digital footprints also is growing, but only in exponential proportions.
Fragmentation can lead to higher noise levels. This is common problem for both production as well as
consumption. Simplification of this process to provide fact based contextual insights is essential.
Analytics without its complexity gig, starts with few basic assertions. Just enough for us to get a grasp
of current state of affairs. More deeper needs, dwell further by making hypothesis, validation of which
requires factual data deep and wide enough to satisfy the ask. It's critical not to loose out on knowns already
in position, while the efforts focus on drawing meaning out of the unknowns. At least, help us stay away from
entering into a chaotic condition.
What ever may be your approach, it is important you don't loose state of data nuggets by their key
features and characteristics. Few critical ones being the dimensions of time, space and those that provide
their relevance to environment in which they are touched. Few aspects you need to keep track are:
• Why – this should be the first question one need to ask, when scoping data nuggets into your activity.
It will help establish a clear reason by the role it plays in the overall equation. Rather being something
vague and ambiguous.
• What – this is the positional or contextual question that specifically declares in what sense you want
to use certain data point. We could answer this by relating it to terms such as expressions,
impressions, presumptions, imaginations, emotions, sentiments, etcetera
• How – once you bring in a data element into the mix, this question will help clarify how you are going
to apply it by its relevance to the context. Few things like would you use certain nugget as a catalyst,
or to complete a whole meaning, or is it used as a substrate, etc.
• Where – this question is more about timing or time dimension where the data element is applied or
produced. This will have direct impact on how the consuming entity would perceive the value. Entities
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 2: Chaos Condition - Interaction Derailment
???
!!!
in this context are Humans, Machines and/or Digital Avatars representing humans.
Once you have slated above aspects clearly, it is also important to check on following operational items:
• Everything bound by timing, without loosing the significance of context.
• Avoid casual approaches, as you'd have solve many casualty clauses through the process of
exploration and insights formulation.
• Honor territorial conditions, as data must be governed for effective utilization.
• Enforce rules and policies so so speak either production or consumption. You don't want to end up in
perennial cycle of baby sitting someone else's desire.
• Mitigate – bottom line, we don't want to get engulfed into a situation where our biological reflexes
and cognitive capabilities are being degraded. This aspect is more around how you define the
methods to deal with unknowns.
4. Big Data – Technical Enumeration
4.1 How much big is Big?
Big is a relative annotation. It is based on who is consuming and in what context. Something that is big
for one entity may seem trivial to another. Let's check on few coordinates, so that we can quantify the
element of Big, without loosing its contextual sense and associated quality aspects. In technical terms, the
list of coordinates include – Volume, Variety, Velocity and Veracity. Few choose Value in the place of Veracity.
Since Value is more contextual, we will leave it at its best abstract meaning and purpose. With all the data
flowing through various channels, communications and exchange hubs, orchestration can become
complicated.
Complexity of the situation can be expressed using a geometrical plane, where each aspect of the
challenge is represented by certain axis point. This is represented in the above visual. Each axis point on the
plane represents certain constraint on the data. These include Volume, Variety, Velocity and Veracity. These
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 3: Big Data Challenges - a visual perspective!
Statistics is scary!!Fix it now! Can't afford Y2K :-(
Machine
Generated Digital
Documents
Empirical
sets
Internet
Streams
Trade &
Commerce
Academic
Decision
Support
Analytical
QA Archive
Cleanse
Acquire
Integrate
Enable
Secure
volume
variety veracity
velocity
constraints drive the overall positional value of insights.
• Volume – sheer size of the data sets as manifested from their source
• Variety – discrete forms in which data or facts can exist, either in parts or as a whole
• Velocity – the rate at which data gets generated
• Veracity – this usually refers to data lineage, helpful in asserting the truth factor associated with
facts
Big Data play has three main components – Data, Infrastructure, and Data Science. Data as we know is a
collection of nuggets that describe facts and entities. The necessary tool-kits to support data aggregation,
storage, and accessibility form the underlying infrastructure. Data Science is the black magic that turns
discrete nuggets into a composite information insight.
For a more intrigued mind, Big Data is a quest for Singularity. Data nuggets are always in a constant spin
where they get sliced and diced until essential insights are produced. It includes analysis of deterministic
behaviors and patterns along with the non-deterministic casualties. Its an amalgamation of inferential
statistics and applied mathematics. Domain expertise and experience are key. It will help in having better
control over noise levels and can deliver clear insights.
Knowing the origins of data sources and getting right set of questions that should be asked is
important. Also, by default assume that data can come from heterogeneous sources. As well as it can come in
fragments. The sources can either be internal or external to your operating realm. Following are few
examples:
• Machine Generated data such as event logs, sensor data, usage metrics, etc.
• Socio-economic digital footprints – social media posts, feedback, and other disjoint sets, etc.
• Residual data from our past consumption. For example, emails, text messages, etc.
• Disintegrated and fractured data – often caused by territorial or boundary conflicts.
As you begin to process these different nuggets through a continuous exploration and constant mining
operation, the outcome must be presented. The presentation is both visual as well as interactive.
Interactivity can either be the need to toggle the available visual cues or set a different query (search or
filter) criteria. Also, be aware of the engineering fallacies. It manifests in the form of either under
provisioning or over provisioning certain need. Besides, associated costs usually run in exponential terms
time and/or money. Hence, failing early is okay, rather failing late.
4.2 Handling the Big
Is Big Data a technological phenomenon or a functional aspect as identified by a use case analysis? Is Big
Data transactional or analytical in nature? Is it handled in batches or in real time? If it's real-time, how real
time are we talking? Or when you do say its real-time? Do we get to work with snapshot data or data that is
constantly in motion aka., streams? What will be the vastness of time dimension in the context? Is it
forecasting or prediction? What will it take to process such vast amounts of data? How significant is the
impact to the consuming layer? How different is data security when compared to conventional scenarios?
These are few questions, among many that are being asked by legions of technology .
Shifting the focus, let us get a sense of available inventory. Inventory that helps you build and support
Big Data requirements. Key components include Computational Power, I/O, Storage and Presentation.
Multi-core processors are now a norm. Besides CPU, few are hedging over the power that a GPU (Graphical
Processing Units) can deliver. Typical Laptops come packed with quad-core processors. Starter clock speeds
of these units begin around 2.7Ghz. And, 8GB RAM with 1TB of minimum storage space is quite common as
well. All these come within the means of economical feasibility range. These components carry more
intelligence than their predecessors. Software Frameworks and other associated tool kits also went through
a major overhaul. Programming languages now support constructs that reveal hooks to unlock true potential
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
of multi-core systems.
As a segway topic, it is now possible to run a decently balanced cluster of compute and storage nodes
on a single hardware unit. For example a Laptop. At least to suffice requirements to be able to run controlled
data science experiments as a lab work. These can then be safely ported to larger operational environments
with little configuration changes. This will avoid the cost of overlooking critical aspects that usually are
hidden at higher levels of abstractions. In subsequent articles we will touch on aspects of parallel run times
and linear scalability. Tool selection is critical to ensure optimal balance between key runtime expectations of
a system. These include consistency, availability and partition tolerance. Of-course with ability to deliver best
performance. While few are on the bleeding edge and evolving further, most of them are on the cutting
edge.
One aspect is very important if you indeed plan to run skunk works on a mobility hardware product –
Energy Efficiency. Efficiency in powering your system as well heat management, for as long as you intend to
run experiments. There are breakers though to avert any fire hazards. Even before they trigger, your system
should be able to handle freezing the compute states.
What else do we need to know about Big Data? This is simple, don't get baffled. Embrace the change
and be ready to fail fast instead of failing long.
4.3 Lego talk – Key Building blocks
Platform to handle big data sets in a cohesive manner must be built by factoring in both the functional
and technical aspects of Data. This is essential at each stage – initial analysis through to the implementation
and operational phases. Following are few essential terms that should be understood well, in this context:
• Real-time – this metric is based on time and space dimensions of the context that you are trying to
meet objectively. Downstream validation can be whetted out either in the form of presentation or an
event to drive further actions.
• Data Pipelines – these are conduits to acquire data from discrete sources of data into the data
platform. They also support dissemination flows to support data as a service requirements.
• Insights Extraction – this is a constant cycle of mining and modeling of data for insights. Goal here is
to enhance richness of insights data can produce.
• Integrated Analytics – productive utilization of big data insights can be achieved if the insights can
exhibit higher levels of accuracy and confidence in how they represent the truth factor. This is where
analytics as a interactive function is essential. This can be achieved through visual and verbal cues.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 4: Focus Areas – Big Data Solution Design
Data Interactions
Analytics
Reports
Visualization
Data Interactions
Analytics
Reports
Visualization
Provisioning
Accessibility
Dev OPS
Security
Provisioning
Accessibility
Dev OPS
Security
Data Sourcing
Data Pipelines
Stream Processing
Insights Extraction
Data Sourcing
Data Pipelines
Stream Processing
Insights Extraction
Regulations
Governance
Standards
Enforcement
Regulations
Governance
Standards
Enforcement
Key
Building
Blocks
Key
Building
Blocks
• Semantic Inferences – Cognitive capacity of humans and their attention span is very limited. To
acquire a consistent experience by insights, semantic inferences cannot be ignored when you package
insights for consumption.
• Quality as we learn is a relative measure driven by consumer requirements. From a producer's
perspective quality metric is defined by descriptors such as durability, responsiveness, timing, etc.
Often times it is important to project these metrics using quantifiable absolute values.
• Post Relational – this has become a reference point to let one view the Big Data evolution from a data
management stand point. More so applied in a very light weight sense. Any ways, what it means is the
data management methods followed so far, do not scale as-is to handle current volumes. Scalability
and Performance are even more challenged and complex. Data and Compute are two facets of this
coin.
4.4 Controls to Throttle
As touched in previous sections, several discrete parts need to be brought together to form a cohesive
whole. Also, it is recommended to presume that these parts can be set in motion, independent of each other,
and also that they can failure without any precursors. Once glued, scaling them becomes a fine art, that
requires a detailed eye. Architectural constraint applied in the industry is the CAP (Consistency, Availability
and Partition Tolerance) theorem. Step down, it is important to focus on concurrency, throughput and
fail-over aspects. One thing to clarify in the above visual, the terms such as Java, Hadoop, Unix, etc., are used
only to set some contextual reference. It can be any other tool with similar or even better capability.
Teeing these topics back into the discussion of aspects such as Data Streams and Data Science, we see a
Big Data Engine at its full throttle. Efficiency is a measure of how perfect the discrete pieces can align with
their peers. Such that you see a fluid, pass through straight line curve of Digital Highways.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 6: Key Throttles to control!
Illustration 5: Moving Parts - Layered Approach
Failover
Throughput
Concurrency
Hardware Resources
(Physical, Emulated)
Hardware Resources
(Physical, Emulated)
Infrastructure
(Hadoop, JVM, Unix,etc.)
Application
(Heap, Data Models, Compute, etc)
Skipping the gory details of implementation, let's assume you now have a system that is completely
provisioned from an infrastructure stand point and is operational. In such circumstances how do you monitor
and manage the system? Ensure five nines productivity and efficiency! What does it even mean to ask for five
nines? We are asking consistent performance, keeping computation efficiency as a constant time factor.
Other influential variables include data that moves in space and time dimension.
Your try-catch-blocks (exception or error handling) can only cover for deterministic behaviors, that
account for known issues. Taming systems when they exhibit nondeterminism is a daunting task. Preceding
illustration presents set of least common denominators aka., controls that you can throttle to monitor and
manage your operational environments. Goal here is to balance the operational constraints Consistency,
Availability and Partition Tolerance (CAP) aspects.
• Consistency – guarantee consistent usability, even if underlying parameters deviate and vary.
• Availability – system services are always available to you on-demand, by constant SLA factor.
• Partition Tolerance – data and compute capacities are spread distributed across various nodes in the
cluster that form the platform. Partition tolerance is about dealing with inherent fragmentations or
dropped communications, and yet able to provide a complete experience to the entities that interact
with the system. This effects primarily the performance, then results in a dwindling state of
consistency and availability.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
5. Big Data – Data-driven Infotainment
5.1 Infotainment – Defining Moment
Paraphrasing the discussions so far, we need the tool to deliver insightful, contextually relevant
information. Primarily with ability to access it on-demand and is derived based on quantifiable metrics and
qualitative inferences. Anything else will be garbaged out easily, without even a single glance. Once an
effective baseline is established there will be at least three broad categories of audience, who would be
drawn towards it – Enterprises, Analysts and Consumers (aka., Users). Their requirements will be very
discrete. Zoom-in close, no wonder you can also sense the randomized patterns that each of their
requirements and levels of abstractions would carry. Right from a naive brain to a highly analytical and
matured one would be glancing at the same piece of information.
The preceding diagram sets a visual context of data flows as a perennial stream. The participating
entities produce and exchange information in more than one form – verbal, non-verbal, audio, and video.
During its course the data gets transformed and even trans-morphed, to suite different needs and contexts.
An ideal infotainment situation is one where unwanted information gets discarded, securely and with some
sense of responsibility.
5.2 Enterprises – What do they do?
Let's switch context to more specific examples. Moving in the order of enumeration, Enterprises
(for-profit or non-profit) consider the infotainment medium as a power boost. It is an essential tool in their
chest net that can unlock potential capabilities that they can further explore, if not totally capitalize. There
are now business models that are purely data-driven. Data is their new currency of trade and commerce to
either improve top-line performance or sustain the bottom-line. The field of Business Intelligence (where
information is constantly augmented to produce actionable insights) is now at least two decades old in its
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 7: Data-Driven Infotainment
_
))
)
_
))
)
_
))
)
Securely disposed
irrelevant data!
path towards maturity. From a mere passive, referential reporting or summed up dashboard experience it is
now moved to a very dynamic field of study. Study that can reflect current state of affairs with more agility,
accuracy and confidence.
As the changes in the underlying fabric are becoming more apparent, the rules of the trade (Standards
and Governance Rules) and scope of their enforcement is also changing. Number of stakeholders and data
stewards is now vastly huge and diverse. As humans we are still fond of instantaneous results. We try to
capture the perspective of an Enterprise (for-profit or a non-profit) in the following visual representation:
5.3 Role of an Analyst
Let's move to the next category of users – the Analysts. On the stage of infotainment this group's role
is primarily to dissect the state of affairs. Rationale thinking and Negation are key characteristics of this
group. Their prime focus is on asking fundamental questions such as Why, What, Where, When, and How.
Modus Operandi can either be independent or biased. Largely influenced by the level of association and
degree of affiliation to any particular Enterprise or Organization. Their core strength lies in their ability to
reflect current state of affairs as a single truth statement. Their job is to provide a clear picture of
undercurrents within the socio-economic fabric.
This group's information requirements come in variety of abstraction levels. Use of information is
subjective to standards, compliance and rules enforced by their controlling authority. Also, once they travel
beyond their confinements the degree of enforcement gets even stricter. Unauthorized application or
misrepresentation of facts would only back-fire, instantaneously, decisively and intensely. They need to be
supported with these requirements as well besides accessibility and availability of required data. Once again,
time and space will be your challenges.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 8: Enterprise Infotainment
$
5.4 Netizens – Ideal State
Moving down the path, is our pen-ultimate and critical actor on the stage of Infotainment – the Netizen!
As powerful they may sound, they can be that vulnerable too! They are the most exposed entity in the whole
ecosystem. Compliance and Social Responsibility usually are the very end of their priority list. They assume
several aspects. Aspects that range from availability to security. If you pause a while and observe very
carefully, they are like inventory to a service or a product. Their participation is driven by merely getting
intrigued, socio-economic interests or by an act of good will and assumed trust.
This group consumes and produces information in frames. Think of a moment in a reel of a film where
you have subjects and some context. That something which is subjective to change with the next scene in the
roll. Typical challenges faced by this group include fragmentation of information and communications.
Information will loose its relevance or value if it is not provided to them with a right timing. It's the same in
both the scenarios – when its being asked for or when it can provided because of their subscriptions and
preferences.
Instead of tuning into multiple channels and platforms, can this group be provided a single facade or a
portal to get a snapshot of their overall socio-economic condition? Will that help! Following illustration
provides a visual perspective of this need here:
You may ask, what are my benefits. It will save your time to learn about socio-economics or at least how
you are faring on that scene. In economic terms will optimize your use of standard communications platform.
May be save few extra pennies that are being thrown out in terms of access costs for Internet on the go! We
are seeing API based approach in many ways. May be worth exploring.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Illustration 9: Connected World Experience - “All things Me”
Motivators by
varying know-how
levels!
Point-in-time Interactions
society
relations
culture
career
finance
health
Serve the
purpose
Boost
Confidence
Objective
RealizationClear
Messaging
ElevateElevate
PositionPosition
Sustain
Grow!
Secure
Information Containment & Exploration Sources
Infotainment
Media
Educational
Systems
Personalized
Digital Assets
Non-Personal
Digital Assets
Non-Digital
Assets
35%
50%
65%
75%80%
25%
Etcetera
6. Next Steps – Breaking it down!
6.1 Guide wire
Subsequent efforts will focus on setting up an experimental big data platform. The platform will be
built using a decently powered Laptop Computer. Base line specifications for the hardware are comparable
processor to Intel i5/i7 quad core, DDR3L 1600 MHz SDRAM – 12 GB minimum, and 7200RPM HDD with
128GB to be spared for the cluster storage. There are few laptops that come with lower HDD speeds at
5400RPM. For example, the cluster being built will use ASUS ROG 750JW Gaming Laptop that comes packed
with all the above, except it's HDD yields 5400RPM. If you'd like you can add another in the spare socket that
clocks higher. Here is the link for more details on the ASUS one!
Further, we will leverage the platform to assert following hypothesis, on the basis of different use cases
and data streams. It's an attempt to represent the data challenges in mathematical terms:
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
Interpretation of above mathematical statement is as follows:
• Variables
◦ Data – An information nugget, that describes either part or a whole of object, event or action
◦ Context – Representative of a situation or an environment setting that gives more time based
relative sense of an object, event or an action. And how it is perceived by participating entities.
Context is nothing different from views on top of database tables, in some sense. Except that,
here the views are non-materialized, and dynamically generated for that moment of consumption.
• Value Range
◦ Number of data points must at least be one, to comprehend it's meaning or purpose and be able
to apply it effectively. Since the outbound value is not determined and is usually driven by the
context, it's represented here as extends to infinity (∞).
◦ Driven by producing or consuming entities, there can be zero or more contexts. For example, a
process responsible to only acquire data and transport it over to some target system, may not
need to check inherent contexts the data set is capable enough to project. Other hand, if a
consuming entity is trying to access certain data set, they may be interested in querying by
context; such that it helps their information requirements. Above all, a producing entity can also
spit out data by contexts such that downstream consumers can use such output effectively, with
less or no overheads of processing. Hence range of context is expressed between zero and infinity
(∞).
◦ In both cases, you may ask, why the starting value in the range is zero? Well, there cannot be a
negative indicator aka., NaN if you want to make sense out of either the context or a data point.
• Key Catalyst
◦ Any insight is considered relevant, only if it can be aligned to the nature of producers and
consumers of underlying data set/information nuggets. Hence, as part of asserting above
hypothesis, we will consider a key catalyst to the variable mix – Actors (participating entities –
producers and consumers)
◦ If you recall the section 5 above, the very first generalized visual sets context on insights being
consumed in frames. Frames that are relevant to consumers. The role of these Actors will be
whetted out by very requirements that each frame in the process expects.
◦ Effect of this catalyst, usually exponential. Reason, we say 'usually' is due to the opportunity
where most parts can be addressed using commonality and granularity adjustments.
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
6.2 Outline
Following are the specifics in how we will pursue this domain and subject constituents further:
• Assemble a mobile Big Data platform on a decently powered Laptop – a cluster of 5 nodes at most!
• Data Aggregation:
◦ Build data ingestion pipelines for data coming in different forms and representation formats
▪ Channels to pour data into the cluster
▪ Computation modules to slice/dice such data into immediately consumable at-rest models
◦ Simulate scenarios to fail the critical modules.
◦ Trace the steps to achieve full system and data recovery
• Analytics
◦ Data Mining with an objective to derive all probable contexts from a given data set
◦ Take such data set for one or more associated contexts and generate insights
◦ Project such insights to meet a User or a Machine requirement objectively
• Presentation (visual and application)
◦ Simulate scenarios to consume the projected insights
◦ Identify and gather learnings from such presentations
◦ Re-cycle them back into the core for further refinement and source data enrichment
• Through and through,
◦ Understand relevancy of tools, given a problem context
◦ Draw parallels between one approach to another – technical and functional
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
7. Conclusion
Big Data started as a marketing buzz, is now settled into more practical channels of application. In a
retrospective approach we have tried to gain our arms around the concept of Big Data. It involved attempts
to learn about the generational shifts, manifestation sources, general applicability and practical relevance to
different categories of entities, that exist in our socio-economic landscape. We also touched on few technical
aspects such as platform architecture, degree of complexity, challenges,etc. As mentioned at the very
beginning of this article, we want to experiment and experience the phenomenon. We will cover the practical
aspects as a sequel article – be3 – Controlled Big Data Experimentation.
8. Keywords
We live in the age of Search. We depend on search tools to ask questions. We are okay with even a
slightest clue about what we are looking for. This section will provide a vocabulary of words and phrases that
will help you gain the context quickly and easily.
• Big Data • Contextual Relevance • Human Cognition
• Relative Sense • Time or Timing • Confidence
• Accuracy • Hadoop • High Performance Computing
• Data and Economics • Infotainment • Datanomics
• Data Science • Mathematics • Relational vs Post-relational
• Meta-data • Information Insights • Data Nuggets
• Netizen • Controlled Data Experiments • Patterns
• Noise • Signals • Data Abstraction
• Governance • CAP • Standards Compliance
• Semantic Inferences • Amplification • Insights with clarity
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
9. Bibliography
This section serves as a bibliography, where in links to various Internet sources that were tapped in to
acquire the background subjective knowledge on the topic of Big Data, Hadoop, Linux and High Performance
Computing. Most of the known articles are referenced directly, with many untracked list of forum posts on
sites such as Stack Overflow, OSDIR, Google Forums, etc.
While some portions of this article provide links to external references, following is the list of all known
and tracked resources available from the Internet. These were used to further the understanding and refine
the grasp.
• Contextual Computing: Our Sixth, Seventh and Eighth Senses
◦ http://www.forbes.com/sites/reuvencohen/2013/10/18/contextual-computing-our-sixth-seventh-
and-eighth-senses/
• Context
◦ http://en.wikipedia.org/wiki/Context
• Economics
◦ http://en.wikipedia.org/wiki/Economics
• Optimality Theory
◦ http://en.wikipedia.org/wiki/Optimality_Theory
• Open Sans Font – Apache License
◦ http://cooltext.com/Download-Font-Open+Sans
• Oracle VirtualBox Documentation
◦ https://www.virtualbox.org/manual/UserManual.html
• Consistency Types
◦ http://en.wikipedia.org/wiki/Consistency_model#Types
• Big Data Interest is Soaring, but Adoption Rates are Stalling
◦ http://www.hightech-highway.com/communicate/big-data-interest-is-soaring-but-adoption-rates-
are-stalling/
• Is iOS7 A Better Innovation Platform than Android?
◦ http://www.forbes.com/sites/haydnshaughnessy/2013/06/19/is-ios-7-a-better-innovation-platfor
m-than-android/
• Manifold
◦ http://en.wikipedia.org/wiki/Manifold
• Technological Singularity
◦ http://en.wikipedia.org/wiki/Technological_singularity
• Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
◦ http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
• w3.org: Semantic Web – Interference
◦ http://www.w3.org/standards/semanticweb/inference
insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}

More Related Content

What's hot

Final communication and connectedness v3
Final communication and connectedness v3 Final communication and connectedness v3
Final communication and connectedness v3
Mia Horrigan
 
Newcastle Intro 2015
Newcastle Intro 2015Newcastle Intro 2015
Newcastle Intro 2015
Lee Schlenker
 
Few data visualization-extending_the_analytical_horizon
Few data visualization-extending_the_analytical_horizonFew data visualization-extending_the_analytical_horizon
Few data visualization-extending_the_analytical_horizon
Elsa von Licy
 

What's hot (9)

2009 06 few
2009 06 few2009 06 few
2009 06 few
 
Computational Thinking in the Workforce and Next Generation Science Standards...
Computational Thinking in the Workforce and Next Generation Science Standards...Computational Thinking in the Workforce and Next Generation Science Standards...
Computational Thinking in the Workforce and Next Generation Science Standards...
 
Final communication and connectedness v3
Final communication and connectedness v3 Final communication and connectedness v3
Final communication and connectedness v3
 
Newcastle Intro 2015
Newcastle Intro 2015Newcastle Intro 2015
Newcastle Intro 2015
 
Nysais presentation may 2010
Nysais presentation may 2010Nysais presentation may 2010
Nysais presentation may 2010
 
Briefings direct transcript how florida school district tames the wild west o...
Briefings direct transcript how florida school district tames the wild west o...Briefings direct transcript how florida school district tames the wild west o...
Briefings direct transcript how florida school district tames the wild west o...
 
Cognitive Internet of Things: Making Devices Intelligent
Cognitive Internet of Things: Making Devices IntelligentCognitive Internet of Things: Making Devices Intelligent
Cognitive Internet of Things: Making Devices Intelligent
 
Linking Enterprise 2.0 to Knowledge Exchange In Organizations
Linking Enterprise 2.0 to Knowledge Exchange In OrganizationsLinking Enterprise 2.0 to Knowledge Exchange In Organizations
Linking Enterprise 2.0 to Knowledge Exchange In Organizations
 
Few data visualization-extending_the_analytical_horizon
Few data visualization-extending_the_analytical_horizonFew data visualization-extending_the_analytical_horizon
Few data visualization-extending_the_analytical_horizon
 

Similar to Be3 experimentingbigdatainabox-part1:comprehendingthescenario

Wk11 the innovation development process
Wk11 the innovation development processWk11 the innovation development process
Wk11 the innovation development process
WaldenForest
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
juliennehar
 
The great collision of open source, cloud technologies, with agile, creative ...
The great collision of open source, cloud technologies, with agile, creative ...The great collision of open source, cloud technologies, with agile, creative ...
The great collision of open source, cloud technologies, with agile, creative ...
Reading Room
 
Mi0040 technology management
Mi0040  technology managementMi0040  technology management
Mi0040 technology management
smumbahelp
 

Similar to Be3 experimentingbigdatainabox-part1:comprehendingthescenario (20)

Economicsof socialcomputing richblankv2_2008
Economicsof socialcomputing richblankv2_2008Economicsof socialcomputing richblankv2_2008
Economicsof socialcomputing richblankv2_2008
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
The Cognitive Digital Twin
The Cognitive Digital TwinThe Cognitive Digital Twin
The Cognitive Digital Twin
 
How I accidentally built a tech startup — without any technological knowledge
How I accidentally built a tech startup — without any technological knowledgeHow I accidentally built a tech startup — without any technological knowledge
How I accidentally built a tech startup — without any technological knowledge
 
Presentation To Seda Technology Programme
Presentation To Seda Technology ProgrammePresentation To Seda Technology Programme
Presentation To Seda Technology Programme
 
Physical Terrain Modeling in a Digital Age
Physical Terrain Modeling in a Digital AgePhysical Terrain Modeling in a Digital Age
Physical Terrain Modeling in a Digital Age
 
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
Infrastructure as Destiny — How Purdue Builds a Support Fabric for Big Data E...
 
Horse Essay In Hindi Language
Horse Essay In Hindi LanguageHorse Essay In Hindi Language
Horse Essay In Hindi Language
 
Wk11 the innovation development process
Wk11 the innovation development processWk11 the innovation development process
Wk11 the innovation development process
 
Keynote Address: Digital Transformation & Cultural Heritage, A provocation in...
Keynote Address: Digital Transformation & Cultural Heritage, A provocation in...Keynote Address: Digital Transformation & Cultural Heritage, A provocation in...
Keynote Address: Digital Transformation & Cultural Heritage, A provocation in...
 
GCSECS-ImpactOfTechnology.pptx
GCSECS-ImpactOfTechnology.pptxGCSECS-ImpactOfTechnology.pptx
GCSECS-ImpactOfTechnology.pptx
 
Technology offering
Technology offeringTechnology offering
Technology offering
 
Requirements Engineering for the Humanities
Requirements Engineering for the HumanitiesRequirements Engineering for the Humanities
Requirements Engineering for the Humanities
 
Toward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docxToward a System Building Agenda for Data Integration(and Dat.docx
Toward a System Building Agenda for Data Integration(and Dat.docx
 
The future of data analytics
The future of data analyticsThe future of data analytics
The future of data analytics
 
Futures Thinking . Media & entertainment
Futures Thinking . Media & entertainmentFutures Thinking . Media & entertainment
Futures Thinking . Media & entertainment
 
The great collision of open source, cloud technologies, with agile, creative ...
The great collision of open source, cloud technologies, with agile, creative ...The great collision of open source, cloud technologies, with agile, creative ...
The great collision of open source, cloud technologies, with agile, creative ...
 
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
Democratizing Advanced Analytics Propels Instant Analysis Results to the Ubiq...
 
Mi0040 technology management
Mi0040  technology managementMi0040  technology management
Mi0040 technology management
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Be3 experimentingbigdatainabox-part1:comprehendingthescenario

  • 1. be3 Big Data in a Box! Part – I : Comprehending the Landscape By Kalyana Chakravarthy Kadiyala Contact Info: Tweet – #ganaakruti Email – kckadiyala@gmail.com LinkedIn – http://www.linkedin.com/in/kadiyalakc/ insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 2. Table of Contents 1.Disclaimer.....................................................................................................................................3 2.Foreword.......................................................................................................................................4 3.Datanomics – The Science behind Big Data..................................................................................5 4.Big Data – Technical Enumeration.................................................................................................7 4.1 How much big is Big?.............................................................................................................7 4.2 Handling the Big.....................................................................................................................8 4.3 Lego talk – Key Building blocks...............................................................................................9 4.4 Controls to Throttle...............................................................................................................10 5.Big Data – Data-driven Infotainment...........................................................................................12 5.1 Infotainment – Defining Moment..........................................................................................12 5.2 Enterprises – What do they do?............................................................................................12 5.3 Role of an Analyst.................................................................................................................13 5.4 Netizens – Ideal State...........................................................................................................14 6.Next Steps – Breaking it down!...................................................................................................15 6.1 Guide wire............................................................................................................................15 6.2 Outline..................................................................................................................................16 7.Conclusion..................................................................................................................................17 8.Keywords....................................................................................................................................17 9.Bibliography................................................................................................................................18 Illustration Index Illustration 1: Connected World – Digitized Human Interactions........................................................5 Illustration 2: Chaos Condition - Interaction Derailment....................................................................6 Illustration 3: Big Data Challenges - a visual perspective!................................................................7 Illustration 4: Focus Areas – Big Data Solution Design......................................................................9 Illustration 5: Moving Parts - Layered Approach..............................................................................10 Illustration 6: Key Throttles to control!............................................................................................10 Illustration 7: Data-Driven Infotainment..........................................................................................12 Illustration 8: Enterprise Infotainment.............................................................................................13 Illustration 9: Connected World Experience - “All things Me”..........................................................14 insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 3. 1. Disclaimer Below are few pointers to help drive your expectations about this article. Idea here is to let you form a baseline context and explore the information according to your consumption appetites. • What is it about? It is an article about Big Data, covering aspects since its age of prominence. • Why now? There is lot of buzz, jargon and variants of same tool kits as well as, solutions alike. • How is it approached? We will identify, assert and map key domain coordinates, as the first step. Subsequently, will try to whet them using several hypothesis to understand data and platforms. • Who are the target Audience? For a intrigued mind, with an urge to explore the Big Data landscape. Where, the approach desired will be to reconcile, learn and experiment. The objective of such experiments will be filling in a given data puzzle using relevant tools and techniques. • Gotchas! While few are deep technical narrations, most parts focus on functional aspects. You can ignore the technical aspects, if you don't want to read those sections. • Handling? Elucidation approach both visual as well as verbal cues. Primarily emphasizing upon the aspect of 'why', and how each stakeholders segment would view them. • Objective – It is multi-faceted, as described below: ◦ Gain a comprehensive understanding about the Domain ◦ Gather essential coordinates to navigate the Landscape ◦ Set stone to build a Big Data Platform to run controlled big data science experiments ◦ Solutions are built with an attempt to draw unit metrics ◦ Once built, we want to take our experiments to environments with larger capacities. • Why this approach? This is a validation approach as we consume different topics, tool kits and implementation concepts. This requires an approach that is agnostic to any connected environment or platform. Hence this approach, where we can travel beyond mere hello world intros. • Is this a single shot? No, the idea here is to have a sequel attempt. Such that we can refine and enrich our knowledge iteratively, taking one aspect at a time and covering it to its full extents. Again, the idea is not to be a jack of all trades! We want to get comfortable in solving specific use cases where data remains the quintessential pivot point. • Note on Consumption ◦ Absolute No's – plagiarism is prohibited by all means and conditions. ◦ Questions – approach the Author. ◦ Criticism – welcome by all means. It helps fail early than fail long and later! ◦ Feedback – log through channels such as tweets, email, and wall posts as indicated on first page. ◦ Terms & Conditions – Author has adopted the GPL license. Terms and conditions in there, apply here too! • Last, but not least ◦ Thoughts and concepts that are presented and touched in different parts of the articulation, have no resemblance to either professional associations or responsibilities to any of the professional associations or responsibilities that author is currently engaged in. ◦ We draw parallels to most commonly seen and experienced use cases in parts or as a whole. ◦ Apply our learnings – technical and functional with an attempt to reconcile the understanding. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 4. 2. Foreword The wave of Big Data is still riding the markets, in its high peaks, sweeping across various Industries and Geographies. Deep within the tech-dungeons, the effects of euphoria still linger over many minds. Select groups had first hand opportunity in experiencing the phenomenon. Examples including those involved in having to deal with web-scale or machine-scale data. The digital footprints we see today are net effects of technological advances during past decade, only exploding what existed in mainstream enterprise before then. The important aspect of this is not being a witness to the evolution. But, be positioned at the epicenter and experience the events by essence, rather by mere tastes. All this occurred long before the lexicon went viral to gain its current prominence. Others gazed and still gaze at it with amusement. Current trend had its genesis at least a decade ago. During the high days of Telecommunications and the Internet revolution, in general. Key contributors to the innovation that we see and experience today, include players from verticals such as Digital Marketing, Information Search, Telecommunications and Social Media (relatively a new comer). Later supporters include those in the Hi-tech and lately folks from Genomics and other branches of Biology. We now have several platforms to support management of these exponentially large digital footprints; many more to come. There are tool kits that enable us to tap into insights the footprints have to reveal. Quality and success of such toolkits and the platforms is measured by their effective handling of noise levels in the data streams. Key here is the quintessential leverage or vantage point within the context and sensitive to the time frame in which its value is relevant. Age of prominence is probably about five years now. Reason we get drawn to it is because we seek betterment. Betterment in our overall Socio-Economic condition as well as how we engage in interactions. The dynamics have changed by far. It all starts with an inquisitive impulse, gradually settling in as an essential need than a desire. This resulted in emergence of a new breed of humans – Netizens. At least within the Urban realms. Many more are being touched by the spread of undercurrents to new territories. Enterprises are driving the adoption, primarily for their own self reasons such as heed the competition, sustain bottom-lines and accelerate top-line performance. With adoption rates on the rise, the Information highways are seeking expansion in their transport and delivery capabilities. Also included is their efficiency to contain and process information as it moves between producers and consumers. Phrases such as text me, tweet me, post it to my wall, customer-360, sentiments, emotions and so forth are now considered part of common expressions. Data nuggets once set in motion get transformed, trans-morphed and even consumed in varying contexts. Related technical terms include Noise, Signals, Data Science, Data Mining, Data Modeling, Data Streams, etcetera! This article is an attempt to comprehend the present day's highlights of the Big Data landscape. As we map essential coordinates, we also set stone on next steps that focus on gaining some practical exposure. We will begin by describing the factors that drove the phenomenon so far. Then try to paraphrase the subject in a more logical sense, using both technical and functional aspects. Finally, provide a perspective on how each stakeholder categories are effected by the under currents. Specially from the aspect of – what it means to the bottom line aka., socio-economic conditions of each participating stakeholder. As part of this effort we use terms such as Big Data, Voids – Data & Digital, Datanomics, Socio-Economics, Limitations of human cognitive capabilities, etc. Details are presented using verbal descriptions, that are supported by visual cues, appropriate enough to give a high level context. All theory and no practicals is not a good way of learning either. It's a fair question to ask for a practical exposure. In subsequent articles we will dwell into specific aspects. Go beyond abstractions, pick some specific use cases and try them using a Big Data Box. We will build the platform in the due course. Canonical title given to this complete attempt is – “be3: Experimenting Big Data in a Box!”. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 5. 3. Datanomics – The Science behind Big Data Datanomics – another gibberish as it may sound, it is the science behind All things Big Data. It is a lexicon derived from two other famous words – the Data and the Economics. Data as most of us know, is a nugget of information that can describe an entity or an event either in parts or as a whole. Economics is a Social Science subject that helps us to visualize how well our economies function. It emphasizes upon studying the patterns associated with production, distribution and consumption of various Goods and Services with some positional value. As social dwellers, we are among those entities that participate, interact and contribute to the overall economic functions and outcomes. Few are personal, while few are more generic and apply to common-core. With digitization the trails of our participation are now held by machines (tools and avatars). Commonly referred buzz word here is – Hi-tech. Digital footprints as we know are mere electronic traces locked inside several silos (specific and contextual, single version of truth as well as snapshots). These traces when analyzed together, can reveal patterns and insights that can prove very important. One can invent and innovate potential opportunities to better our overall socio-economic function. Also, allow us to foresee quintessential future course corrections, to avert any adverse casualties. Preceding visual describes the connected and integrated state of our digitized lives, with reasons to quote. Interaction experiences that we humans gain are very sensitive to the situations. Situations that we encounter, participate, interact and respond on an ongoing basis. Few will have short term effects, while others last long. Such effects can either be positive, negative or neutral in their application and relevance. The takeaways are very much contextual. For example, an Enterprise Organization likes to gain better conversion rates to increase their current market share, retain customers, sustain bottom lines, achieve top-line performance, and so forth. All these require access to insightful information, when in need and in shortest optimal time to response. Factors such as confidence and accuracy are very much hinged on one critical dimension – time or timing. Analytics is the key here. However, the changing dynamics of our socio-economic conditions are demanding us to adapt and insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 1: Connected World – Digitized Human Interactions 01110101101011 0111010110 0111010110 0111010110 0111010110 0111010110 0111010110 0111010110 01110101100111010110capturecapture learnlearn applyapply refinerefine re-use or re-purpose relevance timinglocation outcomelanguage comprehension presentation non-factualfactual situation
  • 6. innovate at a much faster rate. There are two primary reasons to this requirement. First one being, pertinent facts are now generated at machine scale. Such scale is much faster than a normal cognitive human brain can register and process soon the fact is surfaced. Second, due to the inherent latencies and the deviation in time relevance, much of those facts end up being noise in the stream. Success of Analytics is hedged on contextual relevance of insights they produce, given space and time dimensions as key variables. Converting information into insights can be challenging, even when the context is kept constant. Also, when the insights are served there are varying levels of abstracts that each consumer can handle and tolerate. Important catalyst here is human cognition. Few require detailed insights, while others are okay with the gist itself. Mode of representation and communication effects the ability to grasp. Information can be represented and exchanged in either verbal, non-verbal, visual, or as a combination of these. Said that, the field of Information Technology is going through a major shift in its course of evolution. There are now many tools and choices. This is true for both Enterprises and Individuals. With changes in the underlying digital avenues, the size of digital footprints also is growing, but only in exponential proportions. Fragmentation can lead to higher noise levels. This is common problem for both production as well as consumption. Simplification of this process to provide fact based contextual insights is essential. Analytics without its complexity gig, starts with few basic assertions. Just enough for us to get a grasp of current state of affairs. More deeper needs, dwell further by making hypothesis, validation of which requires factual data deep and wide enough to satisfy the ask. It's critical not to loose out on knowns already in position, while the efforts focus on drawing meaning out of the unknowns. At least, help us stay away from entering into a chaotic condition. What ever may be your approach, it is important you don't loose state of data nuggets by their key features and characteristics. Few critical ones being the dimensions of time, space and those that provide their relevance to environment in which they are touched. Few aspects you need to keep track are: • Why – this should be the first question one need to ask, when scoping data nuggets into your activity. It will help establish a clear reason by the role it plays in the overall equation. Rather being something vague and ambiguous. • What – this is the positional or contextual question that specifically declares in what sense you want to use certain data point. We could answer this by relating it to terms such as expressions, impressions, presumptions, imaginations, emotions, sentiments, etcetera • How – once you bring in a data element into the mix, this question will help clarify how you are going to apply it by its relevance to the context. Few things like would you use certain nugget as a catalyst, or to complete a whole meaning, or is it used as a substrate, etc. • Where – this question is more about timing or time dimension where the data element is applied or produced. This will have direct impact on how the consuming entity would perceive the value. Entities insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 2: Chaos Condition - Interaction Derailment ??? !!!
  • 7. in this context are Humans, Machines and/or Digital Avatars representing humans. Once you have slated above aspects clearly, it is also important to check on following operational items: • Everything bound by timing, without loosing the significance of context. • Avoid casual approaches, as you'd have solve many casualty clauses through the process of exploration and insights formulation. • Honor territorial conditions, as data must be governed for effective utilization. • Enforce rules and policies so so speak either production or consumption. You don't want to end up in perennial cycle of baby sitting someone else's desire. • Mitigate – bottom line, we don't want to get engulfed into a situation where our biological reflexes and cognitive capabilities are being degraded. This aspect is more around how you define the methods to deal with unknowns. 4. Big Data – Technical Enumeration 4.1 How much big is Big? Big is a relative annotation. It is based on who is consuming and in what context. Something that is big for one entity may seem trivial to another. Let's check on few coordinates, so that we can quantify the element of Big, without loosing its contextual sense and associated quality aspects. In technical terms, the list of coordinates include – Volume, Variety, Velocity and Veracity. Few choose Value in the place of Veracity. Since Value is more contextual, we will leave it at its best abstract meaning and purpose. With all the data flowing through various channels, communications and exchange hubs, orchestration can become complicated. Complexity of the situation can be expressed using a geometrical plane, where each aspect of the challenge is represented by certain axis point. This is represented in the above visual. Each axis point on the plane represents certain constraint on the data. These include Volume, Variety, Velocity and Veracity. These insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 3: Big Data Challenges - a visual perspective! Statistics is scary!!Fix it now! Can't afford Y2K :-( Machine Generated Digital Documents Empirical sets Internet Streams Trade & Commerce Academic Decision Support Analytical QA Archive Cleanse Acquire Integrate Enable Secure volume variety veracity velocity
  • 8. constraints drive the overall positional value of insights. • Volume – sheer size of the data sets as manifested from their source • Variety – discrete forms in which data or facts can exist, either in parts or as a whole • Velocity – the rate at which data gets generated • Veracity – this usually refers to data lineage, helpful in asserting the truth factor associated with facts Big Data play has three main components – Data, Infrastructure, and Data Science. Data as we know is a collection of nuggets that describe facts and entities. The necessary tool-kits to support data aggregation, storage, and accessibility form the underlying infrastructure. Data Science is the black magic that turns discrete nuggets into a composite information insight. For a more intrigued mind, Big Data is a quest for Singularity. Data nuggets are always in a constant spin where they get sliced and diced until essential insights are produced. It includes analysis of deterministic behaviors and patterns along with the non-deterministic casualties. Its an amalgamation of inferential statistics and applied mathematics. Domain expertise and experience are key. It will help in having better control over noise levels and can deliver clear insights. Knowing the origins of data sources and getting right set of questions that should be asked is important. Also, by default assume that data can come from heterogeneous sources. As well as it can come in fragments. The sources can either be internal or external to your operating realm. Following are few examples: • Machine Generated data such as event logs, sensor data, usage metrics, etc. • Socio-economic digital footprints – social media posts, feedback, and other disjoint sets, etc. • Residual data from our past consumption. For example, emails, text messages, etc. • Disintegrated and fractured data – often caused by territorial or boundary conflicts. As you begin to process these different nuggets through a continuous exploration and constant mining operation, the outcome must be presented. The presentation is both visual as well as interactive. Interactivity can either be the need to toggle the available visual cues or set a different query (search or filter) criteria. Also, be aware of the engineering fallacies. It manifests in the form of either under provisioning or over provisioning certain need. Besides, associated costs usually run in exponential terms time and/or money. Hence, failing early is okay, rather failing late. 4.2 Handling the Big Is Big Data a technological phenomenon or a functional aspect as identified by a use case analysis? Is Big Data transactional or analytical in nature? Is it handled in batches or in real time? If it's real-time, how real time are we talking? Or when you do say its real-time? Do we get to work with snapshot data or data that is constantly in motion aka., streams? What will be the vastness of time dimension in the context? Is it forecasting or prediction? What will it take to process such vast amounts of data? How significant is the impact to the consuming layer? How different is data security when compared to conventional scenarios? These are few questions, among many that are being asked by legions of technology . Shifting the focus, let us get a sense of available inventory. Inventory that helps you build and support Big Data requirements. Key components include Computational Power, I/O, Storage and Presentation. Multi-core processors are now a norm. Besides CPU, few are hedging over the power that a GPU (Graphical Processing Units) can deliver. Typical Laptops come packed with quad-core processors. Starter clock speeds of these units begin around 2.7Ghz. And, 8GB RAM with 1TB of minimum storage space is quite common as well. All these come within the means of economical feasibility range. These components carry more intelligence than their predecessors. Software Frameworks and other associated tool kits also went through a major overhaul. Programming languages now support constructs that reveal hooks to unlock true potential insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 9. of multi-core systems. As a segway topic, it is now possible to run a decently balanced cluster of compute and storage nodes on a single hardware unit. For example a Laptop. At least to suffice requirements to be able to run controlled data science experiments as a lab work. These can then be safely ported to larger operational environments with little configuration changes. This will avoid the cost of overlooking critical aspects that usually are hidden at higher levels of abstractions. In subsequent articles we will touch on aspects of parallel run times and linear scalability. Tool selection is critical to ensure optimal balance between key runtime expectations of a system. These include consistency, availability and partition tolerance. Of-course with ability to deliver best performance. While few are on the bleeding edge and evolving further, most of them are on the cutting edge. One aspect is very important if you indeed plan to run skunk works on a mobility hardware product – Energy Efficiency. Efficiency in powering your system as well heat management, for as long as you intend to run experiments. There are breakers though to avert any fire hazards. Even before they trigger, your system should be able to handle freezing the compute states. What else do we need to know about Big Data? This is simple, don't get baffled. Embrace the change and be ready to fail fast instead of failing long. 4.3 Lego talk – Key Building blocks Platform to handle big data sets in a cohesive manner must be built by factoring in both the functional and technical aspects of Data. This is essential at each stage – initial analysis through to the implementation and operational phases. Following are few essential terms that should be understood well, in this context: • Real-time – this metric is based on time and space dimensions of the context that you are trying to meet objectively. Downstream validation can be whetted out either in the form of presentation or an event to drive further actions. • Data Pipelines – these are conduits to acquire data from discrete sources of data into the data platform. They also support dissemination flows to support data as a service requirements. • Insights Extraction – this is a constant cycle of mining and modeling of data for insights. Goal here is to enhance richness of insights data can produce. • Integrated Analytics – productive utilization of big data insights can be achieved if the insights can exhibit higher levels of accuracy and confidence in how they represent the truth factor. This is where analytics as a interactive function is essential. This can be achieved through visual and verbal cues. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 4: Focus Areas – Big Data Solution Design Data Interactions Analytics Reports Visualization Data Interactions Analytics Reports Visualization Provisioning Accessibility Dev OPS Security Provisioning Accessibility Dev OPS Security Data Sourcing Data Pipelines Stream Processing Insights Extraction Data Sourcing Data Pipelines Stream Processing Insights Extraction Regulations Governance Standards Enforcement Regulations Governance Standards Enforcement Key Building Blocks Key Building Blocks
  • 10. • Semantic Inferences – Cognitive capacity of humans and their attention span is very limited. To acquire a consistent experience by insights, semantic inferences cannot be ignored when you package insights for consumption. • Quality as we learn is a relative measure driven by consumer requirements. From a producer's perspective quality metric is defined by descriptors such as durability, responsiveness, timing, etc. Often times it is important to project these metrics using quantifiable absolute values. • Post Relational – this has become a reference point to let one view the Big Data evolution from a data management stand point. More so applied in a very light weight sense. Any ways, what it means is the data management methods followed so far, do not scale as-is to handle current volumes. Scalability and Performance are even more challenged and complex. Data and Compute are two facets of this coin. 4.4 Controls to Throttle As touched in previous sections, several discrete parts need to be brought together to form a cohesive whole. Also, it is recommended to presume that these parts can be set in motion, independent of each other, and also that they can failure without any precursors. Once glued, scaling them becomes a fine art, that requires a detailed eye. Architectural constraint applied in the industry is the CAP (Consistency, Availability and Partition Tolerance) theorem. Step down, it is important to focus on concurrency, throughput and fail-over aspects. One thing to clarify in the above visual, the terms such as Java, Hadoop, Unix, etc., are used only to set some contextual reference. It can be any other tool with similar or even better capability. Teeing these topics back into the discussion of aspects such as Data Streams and Data Science, we see a Big Data Engine at its full throttle. Efficiency is a measure of how perfect the discrete pieces can align with their peers. Such that you see a fluid, pass through straight line curve of Digital Highways. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 6: Key Throttles to control! Illustration 5: Moving Parts - Layered Approach Failover Throughput Concurrency Hardware Resources (Physical, Emulated) Hardware Resources (Physical, Emulated) Infrastructure (Hadoop, JVM, Unix,etc.) Application (Heap, Data Models, Compute, etc)
  • 11. Skipping the gory details of implementation, let's assume you now have a system that is completely provisioned from an infrastructure stand point and is operational. In such circumstances how do you monitor and manage the system? Ensure five nines productivity and efficiency! What does it even mean to ask for five nines? We are asking consistent performance, keeping computation efficiency as a constant time factor. Other influential variables include data that moves in space and time dimension. Your try-catch-blocks (exception or error handling) can only cover for deterministic behaviors, that account for known issues. Taming systems when they exhibit nondeterminism is a daunting task. Preceding illustration presents set of least common denominators aka., controls that you can throttle to monitor and manage your operational environments. Goal here is to balance the operational constraints Consistency, Availability and Partition Tolerance (CAP) aspects. • Consistency – guarantee consistent usability, even if underlying parameters deviate and vary. • Availability – system services are always available to you on-demand, by constant SLA factor. • Partition Tolerance – data and compute capacities are spread distributed across various nodes in the cluster that form the platform. Partition tolerance is about dealing with inherent fragmentations or dropped communications, and yet able to provide a complete experience to the entities that interact with the system. This effects primarily the performance, then results in a dwindling state of consistency and availability. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 12. 5. Big Data – Data-driven Infotainment 5.1 Infotainment – Defining Moment Paraphrasing the discussions so far, we need the tool to deliver insightful, contextually relevant information. Primarily with ability to access it on-demand and is derived based on quantifiable metrics and qualitative inferences. Anything else will be garbaged out easily, without even a single glance. Once an effective baseline is established there will be at least three broad categories of audience, who would be drawn towards it – Enterprises, Analysts and Consumers (aka., Users). Their requirements will be very discrete. Zoom-in close, no wonder you can also sense the randomized patterns that each of their requirements and levels of abstractions would carry. Right from a naive brain to a highly analytical and matured one would be glancing at the same piece of information. The preceding diagram sets a visual context of data flows as a perennial stream. The participating entities produce and exchange information in more than one form – verbal, non-verbal, audio, and video. During its course the data gets transformed and even trans-morphed, to suite different needs and contexts. An ideal infotainment situation is one where unwanted information gets discarded, securely and with some sense of responsibility. 5.2 Enterprises – What do they do? Let's switch context to more specific examples. Moving in the order of enumeration, Enterprises (for-profit or non-profit) consider the infotainment medium as a power boost. It is an essential tool in their chest net that can unlock potential capabilities that they can further explore, if not totally capitalize. There are now business models that are purely data-driven. Data is their new currency of trade and commerce to either improve top-line performance or sustain the bottom-line. The field of Business Intelligence (where information is constantly augmented to produce actionable insights) is now at least two decades old in its insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 7: Data-Driven Infotainment _ )) ) _ )) ) _ )) ) Securely disposed irrelevant data!
  • 13. path towards maturity. From a mere passive, referential reporting or summed up dashboard experience it is now moved to a very dynamic field of study. Study that can reflect current state of affairs with more agility, accuracy and confidence. As the changes in the underlying fabric are becoming more apparent, the rules of the trade (Standards and Governance Rules) and scope of their enforcement is also changing. Number of stakeholders and data stewards is now vastly huge and diverse. As humans we are still fond of instantaneous results. We try to capture the perspective of an Enterprise (for-profit or a non-profit) in the following visual representation: 5.3 Role of an Analyst Let's move to the next category of users – the Analysts. On the stage of infotainment this group's role is primarily to dissect the state of affairs. Rationale thinking and Negation are key characteristics of this group. Their prime focus is on asking fundamental questions such as Why, What, Where, When, and How. Modus Operandi can either be independent or biased. Largely influenced by the level of association and degree of affiliation to any particular Enterprise or Organization. Their core strength lies in their ability to reflect current state of affairs as a single truth statement. Their job is to provide a clear picture of undercurrents within the socio-economic fabric. This group's information requirements come in variety of abstraction levels. Use of information is subjective to standards, compliance and rules enforced by their controlling authority. Also, once they travel beyond their confinements the degree of enforcement gets even stricter. Unauthorized application or misrepresentation of facts would only back-fire, instantaneously, decisively and intensely. They need to be supported with these requirements as well besides accessibility and availability of required data. Once again, time and space will be your challenges. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 8: Enterprise Infotainment $
  • 14. 5.4 Netizens – Ideal State Moving down the path, is our pen-ultimate and critical actor on the stage of Infotainment – the Netizen! As powerful they may sound, they can be that vulnerable too! They are the most exposed entity in the whole ecosystem. Compliance and Social Responsibility usually are the very end of their priority list. They assume several aspects. Aspects that range from availability to security. If you pause a while and observe very carefully, they are like inventory to a service or a product. Their participation is driven by merely getting intrigued, socio-economic interests or by an act of good will and assumed trust. This group consumes and produces information in frames. Think of a moment in a reel of a film where you have subjects and some context. That something which is subjective to change with the next scene in the roll. Typical challenges faced by this group include fragmentation of information and communications. Information will loose its relevance or value if it is not provided to them with a right timing. It's the same in both the scenarios – when its being asked for or when it can provided because of their subscriptions and preferences. Instead of tuning into multiple channels and platforms, can this group be provided a single facade or a portal to get a snapshot of their overall socio-economic condition? Will that help! Following illustration provides a visual perspective of this need here: You may ask, what are my benefits. It will save your time to learn about socio-economics or at least how you are faring on that scene. In economic terms will optimize your use of standard communications platform. May be save few extra pennies that are being thrown out in terms of access costs for Internet on the go! We are seeing API based approach in many ways. May be worth exploring. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Illustration 9: Connected World Experience - “All things Me” Motivators by varying know-how levels! Point-in-time Interactions society relations culture career finance health Serve the purpose Boost Confidence Objective RealizationClear Messaging ElevateElevate PositionPosition Sustain Grow! Secure Information Containment & Exploration Sources Infotainment Media Educational Systems Personalized Digital Assets Non-Personal Digital Assets Non-Digital Assets 35% 50% 65% 75%80% 25% Etcetera
  • 15. 6. Next Steps – Breaking it down! 6.1 Guide wire Subsequent efforts will focus on setting up an experimental big data platform. The platform will be built using a decently powered Laptop Computer. Base line specifications for the hardware are comparable processor to Intel i5/i7 quad core, DDR3L 1600 MHz SDRAM – 12 GB minimum, and 7200RPM HDD with 128GB to be spared for the cluster storage. There are few laptops that come with lower HDD speeds at 5400RPM. For example, the cluster being built will use ASUS ROG 750JW Gaming Laptop that comes packed with all the above, except it's HDD yields 5400RPM. If you'd like you can add another in the spare socket that clocks higher. Here is the link for more details on the ASUS one! Further, we will leverage the platform to assert following hypothesis, on the basis of different use cases and data streams. It's an attempt to represent the data challenges in mathematical terms: insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞} Interpretation of above mathematical statement is as follows: • Variables ◦ Data – An information nugget, that describes either part or a whole of object, event or action ◦ Context – Representative of a situation or an environment setting that gives more time based relative sense of an object, event or an action. And how it is perceived by participating entities. Context is nothing different from views on top of database tables, in some sense. Except that, here the views are non-materialized, and dynamically generated for that moment of consumption. • Value Range ◦ Number of data points must at least be one, to comprehend it's meaning or purpose and be able to apply it effectively. Since the outbound value is not determined and is usually driven by the context, it's represented here as extends to infinity (∞). ◦ Driven by producing or consuming entities, there can be zero or more contexts. For example, a process responsible to only acquire data and transport it over to some target system, may not need to check inherent contexts the data set is capable enough to project. Other hand, if a consuming entity is trying to access certain data set, they may be interested in querying by context; such that it helps their information requirements. Above all, a producing entity can also spit out data by contexts such that downstream consumers can use such output effectively, with less or no overheads of processing. Hence range of context is expressed between zero and infinity (∞). ◦ In both cases, you may ask, why the starting value in the range is zero? Well, there cannot be a negative indicator aka., NaN if you want to make sense out of either the context or a data point. • Key Catalyst ◦ Any insight is considered relevant, only if it can be aligned to the nature of producers and consumers of underlying data set/information nuggets. Hence, as part of asserting above hypothesis, we will consider a key catalyst to the variable mix – Actors (participating entities – producers and consumers) ◦ If you recall the section 5 above, the very first generalized visual sets context on insights being consumed in frames. Frames that are relevant to consumers. The role of these Actors will be whetted out by very requirements that each frame in the process expects. ◦ Effect of this catalyst, usually exponential. Reason, we say 'usually' is due to the opportunity where most parts can be addressed using commonality and granularity adjustments. insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 16. 6.2 Outline Following are the specifics in how we will pursue this domain and subject constituents further: • Assemble a mobile Big Data platform on a decently powered Laptop – a cluster of 5 nodes at most! • Data Aggregation: ◦ Build data ingestion pipelines for data coming in different forms and representation formats ▪ Channels to pour data into the cluster ▪ Computation modules to slice/dice such data into immediately consumable at-rest models ◦ Simulate scenarios to fail the critical modules. ◦ Trace the steps to achieve full system and data recovery • Analytics ◦ Data Mining with an objective to derive all probable contexts from a given data set ◦ Take such data set for one or more associated contexts and generate insights ◦ Project such insights to meet a User or a Machine requirement objectively • Presentation (visual and application) ◦ Simulate scenarios to consume the projected insights ◦ Identify and gather learnings from such presentations ◦ Re-cycle them back into the core for further refinement and source data enrichment • Through and through, ◦ Understand relevancy of tools, given a problem context ◦ Draw parallels between one approach to another – technical and functional insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 17. 7. Conclusion Big Data started as a marketing buzz, is now settled into more practical channels of application. In a retrospective approach we have tried to gain our arms around the concept of Big Data. It involved attempts to learn about the generational shifts, manifestation sources, general applicability and practical relevance to different categories of entities, that exist in our socio-economic landscape. We also touched on few technical aspects such as platform architecture, degree of complexity, challenges,etc. As mentioned at the very beginning of this article, we want to experiment and experience the phenomenon. We will cover the practical aspects as a sequel article – be3 – Controlled Big Data Experimentation. 8. Keywords We live in the age of Search. We depend on search tools to ask questions. We are okay with even a slightest clue about what we are looking for. This section will provide a vocabulary of words and phrases that will help you gain the context quickly and easily. • Big Data • Contextual Relevance • Human Cognition • Relative Sense • Time or Timing • Confidence • Accuracy • Hadoop • High Performance Computing • Data and Economics • Infotainment • Datanomics • Data Science • Mathematics • Relational vs Post-relational • Meta-data • Information Insights • Data Nuggets • Netizen • Controlled Data Experiments • Patterns • Noise • Signals • Data Abstraction • Governance • CAP • Standards Compliance • Semantic Inferences • Amplification • Insights with clarity insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}
  • 18. 9. Bibliography This section serves as a bibliography, where in links to various Internet sources that were tapped in to acquire the background subjective knowledge on the topic of Big Data, Hadoop, Linux and High Performance Computing. Most of the known articles are referenced directly, with many untracked list of forum posts on sites such as Stack Overflow, OSDIR, Google Forums, etc. While some portions of this article provide links to external references, following is the list of all known and tracked resources available from the Internet. These were used to further the understanding and refine the grasp. • Contextual Computing: Our Sixth, Seventh and Eighth Senses ◦ http://www.forbes.com/sites/reuvencohen/2013/10/18/contextual-computing-our-sixth-seventh- and-eighth-senses/ • Context ◦ http://en.wikipedia.org/wiki/Context • Economics ◦ http://en.wikipedia.org/wiki/Economics • Optimality Theory ◦ http://en.wikipedia.org/wiki/Optimality_Theory • Open Sans Font – Apache License ◦ http://cooltext.com/Download-Font-Open+Sans • Oracle VirtualBox Documentation ◦ https://www.virtualbox.org/manual/UserManual.html • Consistency Types ◦ http://en.wikipedia.org/wiki/Consistency_model#Types • Big Data Interest is Soaring, but Adoption Rates are Stalling ◦ http://www.hightech-highway.com/communicate/big-data-interest-is-soaring-but-adoption-rates- are-stalling/ • Is iOS7 A Better Innovation Platform than Android? ◦ http://www.forbes.com/sites/haydnshaughnessy/2013/06/19/is-ios-7-a-better-innovation-platfor m-than-android/ • Manifold ◦ http://en.wikipedia.org/wiki/Manifold • Technological Singularity ◦ http://en.wikipedia.org/wiki/Technological_singularity • Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services ◦ http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf • w3.org: Semantic Web – Interference ◦ http://www.w3.org/standards/semanticweb/inference insights: f(data) = ∫log(data)context;context = {0,…,∞};data={1,...,∞}