1. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 1
INDEX
- OVERVIEW
- INTRODUCTION
- BIG DATA ANALYSIS PIPELINE
-Data Acquisition and Recording
-Information Extraction and Cleaning
-Data Integration, Aggregation, and Representation
-Query Processing, Data Modelling, and Analysis
-Interpretation
- FIELDS OF RELEVANCE
- TOOLS AND TECHNIQUES IN DTA ANALYTICS AN OVERVIEW
- a / b testing
- Crowdsourcing
- Machine learning
- CASE STUDIES
- Shoppers’ Stop
- Air BnB
- Indian Elections
- 15 Upcoming BIG DATA startups
2. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 2
Overview
“Information is the oil and data analytics is the combustion engine”- Peter Sondergaard
Sine the invention of the World Wide Web information has been mace accessible at a
minute level, this has further created a lot of unstructured data. Technology and mathematical
pioneers predicted the boom in information even the modern times. The scenarios that led to
such a conclusion included the ever expanding information in the field of science, maintenance
of census details, huge amount of journals and publications that had to be stored seemed to be
a tedious task. The rate at which information in expanding today thanks to the connectivity of
the society by the internet and mobile phones.
This report is a glance into the tools and techniques adopted by organizations today in
achieving the optimized solution for various problems, functions to be incorporated into
products, making a value added proposition for each decision taken. The initial effort was to
understand the functional aspects of big data, the report then ventures into the upcoming tools
as an overall and further focuses on three major concepts among them.
The study then analyses three cases of different magnitude, scope and location to
understand the application level of big data at a fundamental and practical level. The study has
focussed on two Indian cases and an international case.
The study also aims at understanding upcoming trends in the analytics environment by
providing top 15 startups solely based on Data Sciences.
3. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 3
2) Introduction:
The term “Big Data,” which spans computer science and statistics/econometrics,
probably originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid
1990s, in which John Mashey figured prominently. The first significant academic references
are arguably Weiss and Indurkhya (1998) in computer science and Diebold (2000) in
statistics/econometrics. An unpublished 2001 research note by Douglas Laney at Gartner
enriched the concept significantly. Hence the term “Big Data” appears reasonably attributed to
Massey, Weiss and Indurkhya, Diebold, and Laney. Big Data the phenomenon continues
unabated, and Big Data the discipline is emerging.
Recent technological advances and novel applications, such as sensors, cyber-physical
systems, smart mobile devices, cloud systems, data analytics, and social networks, are making
possible to capture, process, and share huge amounts of data – referred to as big data - and to
extract useful knowledge, such as patterns, from this data and predict trends and events. Big
data is making possible tasks that before were impossible, like preventing disease spreading
and crime, personalizing healthcare, quickly identifying business opportunities
2.1) Definition:
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed time.
Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes
to many petabytes of data. Big data is a set of techniques and technologies that require new
forms of integration to uncover large hidden values from large datasets that are diverse,
complex, and of a massive scale.
4. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 4
Big data can be described by the following characteristics:
Volume – The quantity of data that is generated is very important in this context. It is the size
of the data which determines the value and potential of the data under consideration and
whether it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains
a term which is related to size and hence the characteristic.
Variety - The next aspect of Big Data is its variety. This means that the category to which Big
Data belongs to is also a very essential fact that needs to be known by the data analysts. This
helps the people, who are closely analyzing the data and are associated with it, to effectively
use the data to their advantage and thus upholding the importance of the Big Data.
Velocity - The term ‘velocity’ in the context refers to the speed of generation of data or how
fast the data is generated and processed to meet the demands and the challenges which lie
ahead in the path of growth and development.
Variability - This is a factor which can be a problem for those who analyze the data. This refers
to the inconsistency which can be shown by the data at times, thus hampering the process of
being able to handle and manage the data effectively.
Complexity - Data management can become a very complex process, especially when large
volumes of data come from multiple sources. These data need to be linked, connected and
correlated in order to be able to grasp the
5. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 5
3) Big Data analysis pipeline:
3.1) Phases in the Processing Pipeline:
3.1.1) Data Acquisition and Recording
Big Data does not arise out of a vacuum: it is recorded from some data generating
source. For example, consider our ability to sense and observe the world around us, from the
heart rate of an elderly citizen, and presence of toxins in the air we breathe, to the planned
square kilometre array telescope, which will produce up to 1 million terabytes of raw data per
day. Similarly, scientific experiments and simulations can easily produce petabytes of data
today.
Much of this data is of no interest, and it can be filtered and compressed by orders of
magnitude. One challenge is to define these filters in such a way that they do not discard useful
information. We need research in the science of data reduction that can intelligently process
this raw data to a size that its users can handle while not missing the needle in the haystack.
6. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 6
Furthermore, we require “on-line” analysis techniques that can process such streaming data on
the fly, since we cannot afford to store first and reduce afterward.
The second big challenge is to automatically generate the right metadata to describe
what data is recorded and how it is recorded and measured. For example, in scientific
experiments, considerable detail regarding specific experimental conditions and procedures.
3.1.2) Information Extraction and Cleaning
Frequently, the information collected will not be in a format ready for analysis.
For example, consider the collection of electronic health records in a hospital, comprising
transcribed dictations from several physicians, structured data from sensors and measurements
(possibly with some associated uncertainty), and image data such as x-rays. We cannot leave
the data in this form and still effectively analyse it. Rather we require an information extraction
process that pulls out the required information from the underlying sources and expresses it in
a structured form suitable for analysis.
3.1.3) Data Integration, Aggregation, and Representation
Data can be very heterogeneous and may have different metadata. Data integration,
even in more conventional cases, requires huge human efforts. Novel approaches that can
improve the automation of data integration are critical as manual approaches will not scale to
what is required for big data. Also different data aggregation and representation strategies may
be needed for different data analysis tasks
Even for simpler analyses that depend on only one data set, there remains an
important question of suitable database design. Usually, there will be many alternative ways
in which to store the same information. Certain designs will have advantages over others for
certain purposes, and possibly drawbacks for other purposes.
3.1.4) Query Processing, Data Modelling, and Analysis
Methods suitable for big data need to be able to deal with noisy, dynamic,
heterogeneous, untrustworthy data and data characterized by complex relations. However
despite these difficulties, big data even if noisy and uncertain can be more valuable for
identifying more reliable hidden patterns and knowledge compared to tiny samples of good
data. Also the (often redundant) relationships existing among data can represent an opportunity
7. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 7
for cross-checking data and thus improve data trustworthiness. Supporting query processing
and data analysis requires scalable mining algorithms and powerful computing infrastructures.
A problem with current Big Data analysis is the lack of coordination between
database systems, which host the data and provide SQL querying, with analytics packages that
perform various forms of non-SQL processing, such as data mining and statistical analyses.
Today’s analysts are impeded by a tedious process of exporting data from the database,
performing a non-SQL process and bringing the data back.
3.1.5) Interpretation
Having the ability to analyze Big Data is of limited value if users cannot understand
the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret
these results. This is rarely enough to provide just the results. Rather, one must provide
supplementary information that explains how each result was derived, and based upon
precisely what inputs. Such supplementary information is called the provenance of the (result)
data.
Visualizations become important in conveying to the users the results of the queries in
a way that is best understood in the particular domain. Whereas early business intelligence
systems’ users were content with tabular presentations, today’s analysts need to pack and
present results in powerful visualizations that assist interpretation, and support user
collaboration. Furthermore, with a few clicks the user should be able to drill down into each
piece of data that she sees and understand its provenance, which is a key feature to
understanding the data. That is, users need to be able to see not just the results, but also
understand why they are seeing those results.
8. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 8
4) Fields of relevance
Big data is relevant for all components of our society. Industry is using big data
for shifting business intelligence from reporting and decision support to prediction and next-
move decisions. This use of big data emphasizes that big data is critical for obtaining actionable
knowledge. Governments are also interested in using big data and predictive analytics to
improve decision making and transparency, to engage citizens in public affairs, to improve
national security. Healthcare represents another major area to which big data may offer novel
opportunities. Learning health systems are currently focusing on turning health care data into
knowledge, translating that knowledge into practice, and creating new data by means of
advanced information technology. As pointed out in, the use of big data technologies can
reduce the cost of healthcare while improving its quality by making care more preventive and
personalized and basing it on more extensive (home-based) continuous monitoring.
Big data is also crucial for research. Many areas of science and engineering are
currently facing from a hundred to a thousand-fold increase in the volume of data generated
compared to only one decade ago. This data is produced by many sources including
simulations, high-throughput scientific instruments, satellites, and telescopes. While the
availability of big data is revolutionizing how research is conducted and is leading to the
emergence of a new paradigm of science based on data-intensive computing, at the same time
it poses a significant challenge for scientists. In order to be able to leverage these huge volumes
of data, new techniques and technologies are needed. A new type of e-infrastructure, the
Research Data Infrastructure, must be designed, implemented and optimized to support the full
life cycle of scientific data, its movement across scientific disciplines, and its integration with
published literature.
9. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 9
Fig 2. Infographic showing current and developing state of big data.
10. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 10
5) Tools and techniques an overview
The main focus on this report is to study the various tools, techniques and
technologies that have been adopted by organizations around the world to reduce this
monumental size of data from its unstructured state to a quantifiable, structured and retrievable
state. The techniques provide a fundamental insight into the basics of retrieving useful data
from big databases or big data as its referred.
Tools and Techniques
A/B testing
Crowdsourcing
Data fusion and integration
Genetic algorithms
Machine learning
Natural language processing
Signal processing, simulation
Time series analysis
Visualization
Data mining
Association rule learning
Classification tree analysis
Regression analysis
Sentiment analysis
Social network analysis
11. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 11
4.1) a/b testing
It is a form of statistical hypothesis testing with two variants leading to the
technical term, Two-sample hypothesis testing, used in the field of statistics. Other terms used
for this method include bucket tests and split testing but these terms have a wider applicability
to more than two variants.
A/B testing, also known as split testing, is a method of testing through which
marketing variables are compared to each other to identify the one that brings a better response
rate. In this context, the element that is being testing is called “control” and the element that is
argued to give a better result is called “treatment.” Running A/B tests in your marketing
initiatives is a great way to learn how to drive more traffic to your website and generate more
leads from the visits you’re getting. Just a few small tweaks to a landing page, email or call-
to-action can significantly affect the number of leads your company attracts. The insights
stemming from split tests can drastically improve the conversion rates of your landing pages
and the clickthrough rates of your website calls-to-action and email campaigns. In fact, A/B
testing of landing pages can generate up to 30-40% more leads for B2B sites and 20-25% more
leads for eCommerce sites.
Fig 3. Analysis of variation in A/B testing
12. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 12
The statistical aspects behind a/b testing
Factor - A controllable experimental variable that is thought to influence the OEC. Factors are
assigned Values, sometimes called Levels or Versions. Factors are sometimes called Variables.
In simple A/B tests, there is a single factor with two values: A and B.
Variant - A user experience being tested by assigning levels to the factors; it is either the
Control or one of the Treatments. Sometimes referred to as Treatment, although we prefer to
specifically differentiate between the Control, which is a special variant that designates the
existing version being compared against and the new Treatments being tried. In case of a bug,
for example, the experiment is aborted and all users should see the Control variant.
Experimentation Unit - The entity on which observations are made. Sometimes called an item.
The units are assumed to be independent. On the web, the user is the most common
experimentation unit, although some experiments may be done on sessions or page views. For
the rest of the paper, we will assume that the experimentation unit is a user. It is important that
the user receive a consistent experience throughout the experiment, and this is commonly
achieved through cookies.
Null Hypothesis - The hypothesis, often referred to as H0, that the OECs(overll evaluation
criteria) for the variants are not different and that any observed differences during the
experiment are due to random fluctuations.
Confidence level - The probability of failing to reject (i.e., retaining) the null hypothesis when
it is true. Power. The probability of correctly rejecting the null hypothesis, H0 , when it is false.
Power measures our ability to detect a difference when it indeed exists.
A/A Test - Sometimes called a Null Test . Instead of an A/B test, you exercise the
experimentation system, assigning users to one of two groups, but expose them to exactly the
same experience. An A/A test can be used to (i) collect data and assess its variability for power
calculations, and (ii) test the experimentation system (the Null hypothesis should be rejected
about 5% of the time when a 95% confidence level is used).
Standard Deviation (Std-Dev) - A measure of variability, typically denoted by 𝜎.
Standard Error (Std-Err) - For a statistic, it is the standard deviation of the sampling
distribution of the sample statistic . For a mean of 𝑛 independent observations, it is 𝜎 / 𝑛 where
𝜎 is the estimated standard deviation.
13. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 13
On an e-commerce website you can test:
Content: headlines, texts, product descriptions, testimonials, etc. I strongly believe that words
can make a huge difference for any business. Communicating the right message, in the right
way, to the right audience will boost conversions instantly.
Images and video. Sometimes, people tend to skip the lines and simply look at the pictures on
your website. Always display high quality pictures, related to the topics on site.
Call-to-action buttons.
Design. I include here: fonts, colours, position of elements on page, etc. The whole website
must have one design to match the brand’s identity. Use matching colours and always check
the meaning of every colour because it has a huge impact on visitors.
Benefits of a/b testing
AB Testing comes in handy because it lowers risks when it comes to important decisions in
the company. Doing AB Testing constantly, will point out what to do and not to do on your
website, and you will know what decision to make.
With AB Testing, failure is not an option. I say this because you have nothing to lose in an AB
Testing experiment. Even if the test hasn’t reached a statistical relevance or if the results are
not how you expected, there’s no financial loss involved.
It is cheaper to use AB testing that to directly modify your website. In case you decide to
modify your website, without testing it first, you invest lots of money and time in programming
and design. And nothing could tell you if the money you spend will get back to you as profit.
But, if you test the variations and you realize it’s not worthy to make those changes, you save
time and money.
Some online tools that help in achieving a/b testing are Google analytics content experiment,
Optimizely, Unbounce, Wingify, Genetify, Five second test etc.
14. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 14
4.2) Crowdsourcing
Crowdsourcing represents the act of a company or institution taking a function once
performed by employees and outsourcing it to an undefined (and generally large) network of
people in the form of an open call. This can take the form of peer-production (when the job is
performed collaboratively), but is also often undertaken by sole individuals. The crucial
prerequisite is the use of the open call format and the large network of potential labourers.
Fig 4. Systematic data flow in crowdsourcing analytics
From avoiding traffic jams, to analysing pedestrian flow patterns, to finding the best
public toilet in town, crowdsourcing apps are showing that many smartphones make for light
work.
With thousands of mini-reports coming in from around the internet, a mosaic of
information can form a larger picture that can be used for many different purposes, from
meteorology to car-sharing.
Using the intelligence of a vast interconnected organism, however, is nothing new: the
venerable Oxford English Dictionary may in fact be the earliest example of crowdsourcing. In
the mid-19th century it made an open call for volunteers to log words and provide examples of
their usage. Over a 70-year period, it received more than six million submissions. Today,
crowdsourcing is used in investing, in creative work and in funding start up projects.
15. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 15
Crowdsourcing sites like represent just one more type of network that will connect
people with products and technology, telling you what products they used, what they thought
of them, and what reviews they read, liked or shared. If you then link the crowdsourcing
network to a business network like LinkedIn, you can connect companies to reviewers and
bring with it lots of context that when comprehensively analysed can transform your
understanding of the reviews:
Analyze the reviews for opinion. What companies are using what products, what they think of
them, and why?
Analyze the interactions for need and intent. If someone read lots of reviews about CRM
systems, there is a reasonable chance they may need a CRM system. If they then share or
recommend a particular review, that may indicate intent to buy or intent to investigate further.
It may also indicate an attempt to sell, however this is easy to catch from the business network.
Analyze the business network for context. Company name, maybe industry, products, location,
social network (which itself could be analyzable for further context – what they are talking
about, who they are associated with, etc.)
All this analysis lays down more data about the data, and allows you to model the
products in a whole new way inferring new characteristics from the organizations providing
the feedback. You can then provide analysis that is personalized around common attributes
between organizations and provide a much deeper drill-down based on a broader set of
harvested features.
This type of analysis goes way beyond what a traditional analyst can possibly
achieve, although it clearly brings with it the risk of introducing noise. So, while the analyst is
not going anywhere anytime soon (and the great ones will always be in demand), these new
approaches of gathering and analyzing data will challenge the status quo. To what outcome,
only time will tell.
Some common crowdsource platforms are Quora, Frilp, Indiegogo,Kickstarter
(crowdfunding) etc
16. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 16
4.3) Machine learning
Machine learning is a scientific discipline that explores the construction and study
of algorithms that can learn from data Such algorithms operate by building a model from
example inputs and using that to make predictions or decisions, rather than following strictly
static program instructions. Machine learning is closely related to and often overlaps
with computational statistics; a discipline which also specializes in prediction-making.
Machine learning is a subfield of computer science stemming from research
into artificial intelligence. It has strong ties to statistics and mathematical optimization, which
deliver methods, theory and application domains to the field. Machine learning is employed in
a range of computing tasks where designing and programming explicit, rule-based algorithms
infeasible. Example applications include spam filtering, optical character
recognition(OCR), search engines and computer vision. Machine learning is sometimes
conflated with data mining, although that focuses more on exploratory data analysis. Machine
learning and pattern recognition "can be viewed as two facets of the same field."
When employed in industrial contexts, machine learning methods may be referred
to as predictive analytics or predictive modelling.
Types of machine learning
Machine learning is usually divided into two main types.
In the predictive or supervised learning approach, the goal is to learn a mapping from
inputs x to outputs y, given a labeled set of input-output pairs D = {(xi, yi)} N i=1. Here D is
called the training set, and N is the number of training examples. In the simplest setting, each
training input xi is a D-dimensional vector of numbers, representing, say, the height and weight
of a person. These are called features, attributes or covariates. In general, however, xi could be
a complex structured object, such as an image, a sentence, an email message, a time series, a
molecular shape, a graph, etc. Similarly the form of the output or response variable can in
principle be anything, but most methods assume that yi is a categorical or nominal variable
from some finite set, yi ∈ {1,...,C} (such as male or female), or that yi is a real-valued scalar
(such as income level). When yi is categorical, the problem is known as classification or pattern
recognition, and when yi is real-valued, the problem is known as regression.
17. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 17
The second main type of machine learning is the descriptive or unsupervised learning
approach. Here we are only given inputs, D = {xi}N i=1, and the goal is to find “interesting
patterns” in the data. This is sometimes called knowledge discovery. This is a much less well-
defined problem, since we are not told what kinds of patterns to look for, and there is no
obvious error metric to use (unlike supervised learning, where we can compare our prediction
of y for a given x to the observed value).
Fig 4. Comparison of 2 types of machine learning.
18. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 18
5) CASE STUDIES
5.1)Case study on Indian retail giant Shoppers Stop
Three years ago when Shoppers Stop Ltd started its Big Data analytics
programme, little did it know that Big Data would lead to big gains .In one of its earliest
analytics programmes, the company studied the buying patterns of members of its loyalty
programme, First Citizen.Based on the insights, it developed targeted promotions for trousers
.This led to around 10 crore worth of additional sales in a three week period for Shoppers Stop.
After analysing its First Citizen base, the company had observed that not all those who buy
shirts also buy trousers. But those who buy both men’s shirts and trousers spend 60% more a
year on average than those who buy only shirts, and thrice as much as those who don’t buy
men’s shirts at all, said Vinay Bhatia, vicepresident, marketing and loyalty, Shoppers Stop.
It then shortlisted over 900,000 people for a “targeted trouser communication”.
According to Bhatia, the 900,000 were further divided into three groups of target customers.
The first group included customers who showed a pattern of being interested in new brands in
other non trouser categories. They were sent information on new trouser brand launches and
fits. The second group included those who exhibited multiple buying patterns in other
categories. They were sent attractive deals if they bought two or more trousers.
Finally, the third was a “control group” to measure success or failure of the promotions.
“This (control group) is a practice that we do for all our analytics insights,” added Bhatia, “the
targeted communication exercise led to a lift of 30% in sales (about `10 crore) when compared
with the response received from the control group. Big Data analytics is now a crucial part of
the company’s strategy.
19. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 19
5.2) Case study on Airbnb
Airbnb is an incredible success story. In just a few years, the company has become a
powerhouse in the travel industry, providing travelers with an alternative to hotels, and
providing individuals who have rooms, apartments or homes to rent with a new source of
income. In 2012, travelers booked over 5 million nights with Airbnb’s service. But it started
small, and its founders—adherents to the Lean Startup mindset—took a very methodical
approach to their success. Joe Zadeh, Product Lead at Airbnb, shared part of the company’s
amazing story. He focused on one aspect of their business: professional photography. It started
with a hypothesis: “Hosts with professional photography will get more business. And hosts
will sign up for professional photography as a service.” This is where the founders’ gut instincts
came in: they had a sense that professional photography would help their business. But rather
than implementing it outright, they built a Concierge Minimum Viable Product (MVP) to
quickly test their hypothesis. Initial tests of their MVP showed that professionally
photographed listings got two to three times more bookings than the market average. This
validated their first hypothesis. And it turned out that hosts were wildly enthusiastic to receive
an offer from Airbnb to take those photographs for them. In mid-to-late 2011, Airbnb had 20
photographers in the field taking pictures for hosts—roughly the same time period where we
see the proverbial “hockey stick” of growth in terms of nights booked
Summary:
• Airbnb’s team had a hunch that better photos would increase rentals.
• They tested the idea with a Concierge MVP, putting the least effort possible into a test that
would give them valid results.
• When the experiment showed good results, they built the necessary components and rolled
it out to all customers.
Analytics Lessons Learned:
Sometimes, growth comes from an aspect of your business you don’t expect. When
you think you’ve found a worthwhile idea, decide how to test it quickly, with minimal
investment. Define what success looks like beforehand, and know what you’re going to do if
your hunch is right..
20. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 20
5.3) Case study on Indian Elections – The first Prime Minister to use BIG DATA
Modi’s use of big data so impressive is that it was both relatively new to Indian
politics, and wrought with unique challenges. Take, for example, the size of the Indian
electorate. With 814 million voters, in comparison to the USA’s 193.6 million and the UK’s
45.5 million, the sheer volume of data of India’s voting population was perhaps the largest
obstacle. The second was the variety of data – India’s voter rolls in 12 different languages and
900,000 PDF’s amounting to 25 million pages made for a heterogeneous, non-uniform and
deeply diverse information set. Finally, the veracity of the information was often questionable
– one report noted that some voters were listed as 19,545 years old, and others a confounding
0 years old. Name overlapping (there are 327,000 women named “Sita” in Bihar alone) only
further complicated the process.
Despite these challenges, the rewards – as Modi has clearly demonstrated while
employing this data to “drive donations, enroll volunteers, and improve the effectiveness of
everything from door knocks…to social media” – are significant. BJP’s website, for
example, planted cookies on all computers that visited its site, and then used information
about these users’ further internet activity – i.e., the sites they visited after BJP’s – for
customised advertisements:
“If you move out of the BJP website and visit a website for bikes followed by a
search on jobs, the algorithm will make the inference that you are a young male from a
particular constituency, say Delhi, who is currently on a job hunt. What happens next is when
you visit a job searching portal like Naukri.com, this system pops up a contextual ad for you
like ‘jobs in Delhi’. The BJP banner which is just below the results will tell you ‘There are
no Jobs in Delhi. India deserves better’.” – source
Tactics like these — both online and offline analytics and marketing — were the
backbone to Modi’s success. He lead the charge with both social media and the analysis of
publicly available data. Whereas Indian politicians have been known to rely on “hunches and
intuitions to gauge complex demographics of caste, religion, community and localities…,”
21. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 21
6) 15 Indian Big Data companies to watch out for:
1. Heckyl: TechSparks 2011 winner company in financial data analytics space.
Founded by Mukund Mudras, Som Sagar, Abhijit Vedak and Jaison Mathews.
2. Sigmoid Analytics: A TechSparks 2014 company, based out of Bangalore,
Sigmoid is in the area of real-time Big Data warehousing, streaming and ETL
(extract, transform and load) on Apache Spark. They have a technology
infrastructure which companies can use to store their data in a desired format,
perform operations on it and generate insights.
3. Flutura: Mines Big Data to perform analytics and gives hidden insights from
huge chunks of machine generated data for global oil and gas majors to bring in
efficiency and safety. Flutura was founded by Krishnan Raman, Derick Jose and
Srikanth Muralidhara.
4. Indix: Computes real-time data to give product insights for decision makers on
an intuitive dashboard. Founded by Sanjay Parthasarathy, the company has its
product engineering center based out of Chennai.
5. Fractal Analytics: Helps companies in predictive analytics and decision sciences
to understand, predict and shape consumer behavior through advanced analytics,
harmonize data, tell visual stories and forecast business performance.
6. Crayon Data: An algorithms called the WhiteBox, Simpler Choices, takes
massive data, cleans it up and presents only actionable insights to banking,
hospitality and, telecom sectors. It was founded by Srikant Sastri, Suresh,
Shankar and Vijay Kumar.
7. Germin8: It is a leading Data Analytics company that helps brands with social
media measurement and monitoring solutions by analysing conversations in real
time. The Mumbai-based company was founded in 2007 by Raj Nair and his son
Ranjit Nair.
8. Aureus Analytics: With its platform called ASAP (Aureus Statistical and
Analytics Platform) it produces insights by mining enterprise data. Aureus was
founded by technology professionals Anurag Shah, Ashish Tanna and Nitin
Purohit.
9. Dataswft: A product of Bizosys Technologies Pvt Ltd, it has a customized search
engine that can decode technical information and return search queries within
22. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 22
milliseconds. It was founded by Sunil Guttula, Abinasha Karana and Sridhar
Dhulipala.
10. C360: Corporate360 Pvt Ltd provides IT sales intelligence data services to
enterprises. The startup was founded by college dropout Varun Chandran. Prior
to founding Corporate360, Varun was working as sales and marketing executive
with the likes of SAP, Oracle, Dell and NetApp. C360 is based in India and
Singapore. Another similarly located Big Data company is Antuit holdings,
which raised $56 million from Goldman Sachs and Zodius Capital.
11. Metaome: It is a health care Big Data company focused on life sciences, founded
by Kalpana Krishnaswami and Ramkumar Nandkumar. Metaome’s products
DistilBio a free version web-based graph search and enterprise version platform
that accrues a variety of data from difference sources (laboratory data
management systems, private and public databases) and makes it structured to
help in identifying a pattern.
12. Frrole: It is a Social intelligence startup with a media and brands focused
offering, which allows its customers to integrate real-time Twitter data into their
digital properties and TV shows. The startup was founded by Amarpreet Kalkat,
Nishith Sharma and Abhishek Vaid.
13. Bridgei2i: It focuses on user-centric applications of Big Data. Founders are
Prithvijit Roy, Ashish Sharma, Pritam Kanti Paul.
14. Formcept: Focused on making data analysis accessible to everyone; founders are
Suresh Srinivasan and Anuj Kumar.
15. PromptCloud has been founded by Prashant Kumar. PromptCloud is a DaaS
(Data-as-a-Service) platform; it crawls the websphere for data extraction and has
been founded by Prashant Kumar.
23. Tools and techniques adopted for Big Data analytics.
Dept.of Industrial EngineeringandManagement,Bangalore Instituteof Technology| 23
7) Bibliography
1. "Data, data everywhere". The Economist. 25 February 2010. Retrieved 9
December 2012.
2. Jump up^ "Community cleverness required". Nature 455 (7209): 1. 4
September 2008.
3. Jump up^ "Sandia sees data management challenges spiral".
4. Jump up^ Reichman, O.J.; Jones, M.B.; Schildhauer, M.P. (2011).
"Challenges and Opportunities of Open Data in Ecology".
5. Practical Guide to Controlled Experiments on the Web: Listen to Your
Customers.
6. An introduction to a/b testing for marketing optimization.
7. Machine Learning-A Probabilistic Perspective by Kevin P. Murphy.
8. CDAS: A Crowdsourcing Data Analytics System – paper by NUS.
9. Case study about Shoppes Stop on
http://www.livemint.com/Industry/J5NVBrcewAEM0qF02daqyL/Retail-
sector-gains-big-from-Big-Data.html
10.Case study about Indian election on http://dataconomy.com/narendra-
modi-first-prime-minister-use-big-data-analytics/
11.Case study on AirBnB on http://www.quibb.com/links/analytics-lessons-
learned/view