4. We have internal
information. Getting
information from outside is
our challenge. There’s no way
of doing that.
– Senior Editor
Leading Media Company
“
8. UNCOVER YOUR DARK DATA
Source: http://www.patrickcheesman.com/dark-data-problems-and-solutions/
• INACCESSIBLE data (e.g. technology is outdated)
• FORGOTTEN data (e.g. collected, but not actively used)
• UNCOLLECTED data (e.g. information exists, not digitized)
• SINGLE PURPOSE data (e.g. used for a specific purpose)
9. We’ve used network diagrams to detect terrorism, corporate fraud, product
affinities and behavioural customer segmentation
10. AUGMENT YOUR
DATA
SOURCES
DATA IS
EVERYWHERE
COMMON COMPLAINT #1
WE DON’T HAVE DATA
COMMON COMPLAINT #2
THE DATA ISN’T STRUCTURED
CRM DATA
SALES DATA
PRICING DATA
CALL RECORDS
WEB LOG DATA
VENDOR INVOICES
SOCIAL MEDIA DATA
CLICKTHROUGH DATA
COMPETITOR RESEARCH
CUSTOMER TRANSACTIONS
…
CENSUS DATA
E-COMMERCE PRICES
COMMODITY PRICES
STOCK MARKET DATA
FINANCIAL REPORTING
SOCIAL MEDIA DATA
MOBILE PENETRATION
AADHAR DATA
COURT CASE BRIEFS
SHAPE FILES
…
11. How does Mahabharata, one of the largest epics with 1.8
million words lend itself to text analytics?
Can this ‘unstructured data’ be processed to extract
analytical insights?
What does sentiment analysis of this tome convey?
Is there a better way to explore relations between
characters?
How can closeness of characters be analysed & visualized?
Visualising the Mahabharata
12. “ Can we help CFOs
understand what questions
are being asked by
investors and analysts
during earnings releases?
How this is different from
competition?
– Product Head
Global Financial
Services Firm
14. DATA IS
EVERYWHERE
EXTRACT THE
META DATA
AUGMENT YOUR
DATA
SOURCES
COMMON COMPLAINT #2
THE DATA ISN’T STRUCTURED
COMMON COMPLAINT #3
THE DATA ISN’T RICH / CLEAN
COMMON
WHO, WHAT, WHEN, WHERE
TEXT
TEXT KEYWORDS
SENTIMENT
IMAGE
VISUAL RECOGNITION
AUDIO / CALLS
TRANSCRIPTS
MOOD ANALYSIS
15. “ Can we get the results of
every single election in
history, and create a portal
to visualize these results?
– Rajdeep Sardesai
CNN-IBN
19. … with several names spelt wrong
These are, in fact two
different constituencies
But these are exactly
the same
... and so are these
I’ve no idea if these are
2, or 3, constituencies!
20. … with the ability for the system to correct errors automatically
21.
22. DATA IS
EVERYWHERE
TRANSFORM THE DATA &
ENRICH IT
EXTRACT THE
META DATA
AUGMENT YOUR
DATA
SOURCES
COMMON COMPLAINT #3
THE DATA ISN’T RICH / CLEAN
24. This is a dataset (1975 – 1990) that has
been around for several years, and has
been studied extensively. Yet, a
visualization can reveal patterns that
are neither obvious nor well known.
For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the C-section option to move dates?
• Is there any day of the month that has unusually high or low births?
• Are there any months with relatively high or low births?
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
LET’S LOOK AT 15 YEARS OF US BIRTH DATA
25. THE PATTERN IN INDIA IS QUITE DIFFERENT
This is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
26. THIS ADVERSELY IMPACTS CHILDREN’S MARKS
It’s a well established fact that older
children tend to do better at school in
most activities. Since many children
have had their birth dates brought
forward, these younger children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the
month tend to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
28. RESTAURANT FOUND AN UNUSUAL DIP IN
SALES
A restaurant chain had data for every
single transaction made over a few
years. Plotting this as a time series
showed them nothing unusual.
However, the same data on a calendar
map reveals a very different story.
Specifically, at the bottom left point-of-sale terminal, sales dips on every
Wednesday. At the bottom right point-of-sale terminal, sales rises on
every Wednesday (almost as if to compensate for the loss.)
It turns out that the manager closes the bottom-left counter every
Wednesday afternoon due to shortage of staff, assuming that it results in
no loss of sales. There is, however, a net loss every Wednesday.
30. Nation-wide statistics on
behaviour and performance of students
Over 1,000 questions each administered to
several lakhs of students across the country
31. Having books improves reading ability
Having more books at home improves the performance of children when it
comes to reading. (But children typically only have only 1-10 books at home)
… but the impact in social is less
While having more books improves the reading % score by 8%, it only
increases the social % by 4%
33. Watching TV occasionally is good
Children who watch TV
every day don’t do as well
as children who watch TV
only once a week.
But children who never
watch TV fare the worst.
Watching TV every day
helps improve children’s
reading ability a little bit
more…
… but mathematical
abilities fall dramatically at
that point
34. Having educated parents helps most
This table shows the % improvement in score due to each factor
THIS TECHNIQUE CAN BE
APPLIED TO ANY DATASET
35. AUTOMATING ANALYSIS IN POULTRY FARMING
We group by every
input factor
… and calculate the
impact on every metric.
By moving from average to the best
group, what’s the improvement?
The actual performance
by each group is shown
0-3m 3-6m 6m-1yr 1-2 yrs > 2 yrs
11 12.3 12.7 15.3 16.1
Our product can create visualisations from data automatically, without any supervision.
Above is an example. Irrespective of the dataset, this visual shows which input parameters
have a significant impact on the output. Another such example is the cluster scatterplot.
Only significant results shown
36. 68% correlation
between AUD & EUR
Plot of 6 month daily
AUD - EUR values
Block of correlated
currencies
… clustered
hierarchically
41. S ANAND, CHIEF DATA SCIENTIST, GRAMENER
THE CAPABILITIES ARE
IN YOUR REACH TODAY
EXPLORE THE ART OF DATA
Editor's Notes
https://flic.kr/p/aCqg7w
For the same chain, we also looked at the daily sales across restaurants. Here are a series of calendar maps showing the daily sales for four different points of sale terminals at one restaurant. Each calendar map shows a calendar for 7 months. Each day is coloured based on the value of sales on that day. Red indicates low sales, green indicates high sales.
For the two terminals at the front (i.e. the ones you see on top), sales was relatively low during the first two months, but picked up steadily thereafter. It’s easy to spot the exceptions among this. For example, the 30th and 31st of January were good days for both terminals.
Interestingly, when you look at the terminal at the bottom left, there is a red bar indicating consistent dip in sales every Wednesday. Almost as if to compensate, the terminal at the bottom right has an increase in sales every Wednesday – but not as significant as the dip.
We did not have an explanation for this, though our client did a few weeks later. It turned out that the person manning the bottom left counter takes half-day off every Wednesday, and was not being replaced by the manager. The queue naturally shifts over to the other terminal, increasing the sales. But this restaurant is in an area where there are many other food outlets. Once the queue reaches a certain size, people drop off, resulting in a net loss in sales every Wednesday – a loss that had gone unobserved for at least 7 months.
So, what we did was put a variant of this visual together. On the right, you have a series of currencies like the Australian dollar, the Euro, the British pound, etc; some commodities like silver and gold; and some stock indices like Sensex, FTSE, and S&P.
The cells here have a number inside that indicates the pairwise correlation between a pair of securities. For example, the number 68 on the top left indicates a 68% correlation between the Australian dollar and the Euro. To the left of the Euro and just below the dollar (diagonally opposite to the 68), there’s a scatter plot that shows the daily prices of both these currencies. Each dot is one day’s data. The x-axis shows the Australian dollar value. The y-axis shows the Euro value. This helps identify what the pattern of movements of any two currencies is. From this, you can easily see visually that the Australian dollar and the Euro both tend to move together. Or, where there are strong correlations like the FTSE & S&P, the pattern is almost a straight line.
In some cases there are negative correlations. For instance, if you take the Sensex against the Japanese Yen, the correlation is -79%. The cells are coloured based on their correlation values. Greens indicate strong positive correlation. Reds indicate strong negative correlation.
These are also grouped hierarchically. On the left, we have a series of lines indicating clusters. The most similar securities are grouped together. So FTSE and S&P with a 98% correlation are very close. The ones that are less correlated are kept further away based on a tree-structure.
This leads to clustering of securities. For example, there is a green block in the center which has SGD, JPY, XAU, CHF and CNY. All of these are fairly well correlated. When any one currency in this block goes up, all the others go up as well. When any one goes down, all others go down as well.
Similarly, you have another block to its top left: S&P, FTSE, Sensex and to a certain extent, the Pakistani Rupee. These move together as a block as well.
But when this block goes up, all the currencies in the other block go down, as indicated by the red negative correlations between these two blocks.
This can be used very easily for decision making. For example, one client who was trading with Singapore and Japan looked at the strong correlation and decided to consolidate their holdings in Japanese Yen. They then moved up and down this column to find a good hedge. FTSE looked like a good hedge – it was the most negatively correlated with JPY at that time -- and they decided to place a third of their portfolio in FTSE.
A sheet like this improves people’s understanding of relatively complex data, and results in significantly increased trade volumes.
We were working with a restaurant who had 7 months’ worth of sales data, and asked what we could do with this data. It was a fairly open-ended problem.
Among other things, we looked at the various product categories they sold, such as starters, breads, desserts, etc. and the pairwise correlations between each of these.
The number in each cell shows the pairwise correlation between any two products. The 17 on the top left, for example, indicates a 17% correlation between side dishes and meals. The scatter plots diagonally opposite show the correlations between these visually as well. These are colour coded based on the correlation. The redder it is, the more negative the correlation. The greener it is, the more positive the correlation.
There are a few patterns that emerge. For example: desserts are positively correlated with every product. The row and column are green right through, indicating that it doesn’t matter what people eat – they usually have desserts at the end.
Starters are an interesting category. They were introduced 4 years ago as a loss-leader, with the aim of increasing the restaurant’s menu variety and to bring in footfall. As a result, they were priced at cost. You can see from this that starters sell well with breads (rotis, naans, etc). They sell well with desserts, but then, everything sells well with desserts. But they reduce the sales of every other product!
What’s been happening is that since starters were so attractive, people were coming in, ordering starters and desserts, and leaving. As a result, this initiative had been a net loss for the profit margin, though it had not been spotted for nearly four years.
When you look at the correlations at an individual item level, it turns out that there’s one product that is negatively correlated with almost every other product: the 1 litre mineral water bottle.
This is a curious phenomenon, and our client explained this once they realised what was happening. Theirs is a low-end chain of restaurants and it’s mostly individuals (not families) that visit this restaurant. Their customers are rather price-conscious. When they buy 1 litre of water, they want to make sure that they do not waste it. And when an entire litre is consumed, there’s not much space in the stomach for other things.
An obvious solution was to replace the 1 litre packaging with a smaller 200ml bottle. This ends up turning the entire row and column of reds into neutral yellows, resulting in an overall increase in sale of all products.