10. Big Data的緣起
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile or wearable devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
10
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
12. 1-Scale (Volume)
Data Volume
44x increase from 2009-2020
From 0.8 zetta bytes to 35zb
Data volume is increasing exponentially
12
Exponential increase in
collected/generated data
Characteristics of Big Data
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
1012 1015 1018
1021
13. 2-Complexity (Variety)
Various formats, types, structures (or
unstructured ones).
Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dimensional arrays, etc…
Static data vs. streaming data
A single application can be
generating/collecting many types of
data
13
To extract knowledge all these types of
data need to be linked together
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
14. 3-Speed (Velocity)
Data is being generated fast and need to be processed fast
Online Real-time Data Analytics
Late decisions missing opportunities
Examples
e-Promotions: Based on your current location, your purchase history, what
you like send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body
any abnormal measurements require immediate reaction
14
From: web.cs.wpi.edu/~cs525/s13-MYE/lectures/1/intro.pptx
20. Harnessing Big Data
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
20
21. What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
21
27. “Big data hubris,” or just nitpick !?
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014.
“The Parable Of Google Flu: Traps In Big Data Analysis.” Science 343 (14 March)
37. Fashion trends among consumers often
change in the blink of an eye
Philosophy of Zara
The apparel industry stresses about the need to react rather
than predict.
Developed a business model where speed and decentralized
decision-making was essential.
Zara’s Fast Fashion
Understanding the items that its customers actually want.
38. Strategies of Zara
Vertical Integration
Small Batch Production
Collecting Vital Information for Decision Making
Selling well objects : Type of fabric, cut, and colors
Quick response to Demand (Pull System/Message Sharing)
Analyze “Regional Pop”
Make the market segmentation closest to the customer needs.
High Product Turnover
Strong IT System
Real-time Knowledge(Dataflow) in the entire distribution-to-sale process
49. Zara Online Shop
Collect feedback to manufacturing
Find out the target market exactly
Held consumer opinion survey,
capture customer feedback to
improve the actual shipping
products
50. vs. Big Data
Information Integration, Focus on customer requirement, Decentralized decision-making
In-store Online Shop
Customer Behavior
PoS
Click Tracking
Online Fourm
Consumer survey
DATA
Daily Report
High-velocity
shipping
Prototype Survey
Real-Time Data
Fashion Analysis
market segmentation
Quick Change Artist
Agile Management
51. Analytical Culture in Zara
Online Retail Websites KPIs Company Marketing KPIs
Purchase conversion
Average Order Size
Items per Order
Purchase dropouts rate
Effect on offline sales
Returned items rate
Response rate by segment
Response rate by the marketing
media
Response rate by marketing
message
Cost per marketing
campaign/cost per sale
Revenue per marketing
campaign/revenue per sale
Company Strategic KPIs
Ratio of winning designs
Ratio of cross-brand conversions (in INDITEX retail group)
61. 61
Case : Big Data in Education
http://www-01.ibm.com/software/analytics/education/resources.html
MOOCs
Huge potential from Big Data perspective.
Learning portfolio for everyone?
因材施教(Self-directed and adaptive learning.
人力資源(HR development).
IBM
Collects academic, disciplinary and attendance data from school districts.
Analyzes over150 key metrics, and presents information in reports and dashboards.
Develops early warning to alert teachers and counselors to at-risk students before
they drop out. Upt0 25% reduction in dropout rate.
Problem: In U.S. high schools, dropout rate is over 30% .
In Mobile County of Alabama, that stood at 48%, translating into roughly 2,500
youths.
How to reduce the annual dropout rate?
68. Big Data Challenges
1. Meeting the need for speed
In today’s hypercompetitive business environment,
companies not only have to find and analyze the relevant
data they need, they must find it quickly. The sheer
volumes of data and accessing the level of detail needed,
all at a high speed.
2. Understanding the data
It takes a lot of understanding to get data in the right shape.
69. Big Data Challenges (cont.)
3. Addressing data quality
The value of data for decision-making purposes will be
jeopardized if the data is not accurate or timely.
4. Displaying meaningful results
Represent analysis result becomes difficult when dealing with
extremely large amounts of information or a variety of
categories of information.
5. Dealing with outliers
Outliers may not be representative of the data, they may also
reveal previously unseen and potentially valuable insights.
70. The Future of Big Data
Stop talking about how the quality of data matters less,
We are only starting to get to a point where we are truly
able to focus on the quality of big data.
Big data must be effectively stored, transferred,
transformed and analyzed without threatening the original
data.
Bigger, Better, Faster, Stronger
Ratio of winning designs – given the amount of new designs churned out by the company annually – 11 thousand vs. two – four thousand by other major fashion brands, it is very important to cull out the losers and focus on the winning designs. This also can be helpful in identifying the broader fashion trends in order to align the design team’s efforts with the customers’ demands.
Ratio of cross-brand conversions – Zara is a leading brand in the INDITEX retail group. By analyzing customers’ purchases, behavior and attitudes collected through multiple channels it is extremely important to be able to monetize the existing customers to the fullest extent possible. For example, referral of Zara Home customers to Zara and vice versa could prove to be an efficient and effective way of monetizing existing customers due to the brands recognition.
Purchase conversion – the number of purchases over the total number of visits. This metric is specifically helpful if it is tied-in with the changes made to the site itself and the changes to the product inventory.
Average Order Size – the dollar/euro/other currency amount spent on each order. For most retailers, including Zara, it is an important metric, given that the profit margins are often related to the dollar value of the purchase. Therefore the bigger monetary value of the order, the lower is the overhead.
Items per Order – shows the effectiveness of cross-selling on the site. It can also be connected to the effectiveness of specific promotional campaign. This helps to optimize recommendations to various groups of customers and fine tune the marketing message.
Purchase dropouts rate – the number of customers who abandoned the purchase to the total number of customers who started the purchasing process. This metric is extremely helpful in identifying what steps in the sales process deter the customers from completing the transactions. The cause of this dropout could be in the website design or in disclosing additional information to the customer. For example, customers may abandon the purchase once the shipping cost is added to the total cost. This information may give further insights on customers’ price sensitivity or timeliness of the order, and prompt Zara to optimize its shipping and handling operations.
Effect on offline sales – this metric is usually difficult to measure, but is extremely important to capture. Not all online visits may end up with purchase, but some of those visitors could be gathering additional information before visiting the store to make a purchase. This metric could be partially measured by offering some additional service or gift at the store with the code provided online.
Returned items rate – the percentage of returned items purchased online. It is critical to receive and analyze customer’s feedback on the reasons of the return. For example this metric may indicate deficiencies in product presentation online, whether it is the picture of the item or its verbal description. Could be helpful in improving the copy quality and clarity.
Response rate by segment – after the customers segmentation is done, Zara can measure which segment responses better to particular marketing campaign. Segmentation can be done by demographic, behavioral, attitudinal, geographic and other criteria.
Response rate by the marketing media – will measure if customers respond better to regular mail, email, website promotion or in-store promotional offers.
Response rate by marketing message – this can be done in any of the marketing channels: online, in-store, mailing campaign. However online testing is offering the most efficient way of testing the marketing message due to its near real time availability of the analysis of the data. After the message is tested online, it can be transferred to other marketing channels with greater level of assurance about its effectiveness.
Cost per marketing campaign/cost per sale – measures effectiveness of a particular campaign in monetary terms.
Revenue per marketing campaign/revenue per sale – measures the overall revenue generated by a particular campaign or by an average sale in the campaign.
科技誕生的促動期 (Technology Trigger)[編輯]
在此階段,隨著媒體大肆的報導過度,非理性的渲染,產品的知名度無所不在,然而隨著這個科技的缺點、問題、限制出現,失敗的案例大於成功的案例,例如:.com公司 1998~2000年之間的非理性瘋狂飆升期。
過高期望的峰值(Peak of Inflated Expectations)[編輯]
早期公眾的過分關注演繹出了一系列成功的故事——當然同時也有眾多失敗的例子。對於失敗,有些公司採取了補救措施,而大部分卻無動於衷。
泡沫化的底谷期 (Trough of Disillusionment)[編輯]
在歷經前面階段所存活的科技經過多方扎實有重點的試驗,而對此科技的適用範圍及限制是以客觀的並實際的了解,成功並能存活的經營模式逐漸成長。
穩步爬升的光明期 (Slope of Enlightenment)[編輯]
在此階段,有一新科技的誕生,在市面上受到主要媒體與業界高度的注意,例如:1996年的Internet ,Web。
實質生產的高峰期 (Plateau of Productivity)[編輯]
在此階段,新科技產生的利益與潛力被市場實際接受,實質支援此經營模式的工具、方法論經過數代的演進,進入了非常成熟的階段。
From : SAS 2014, “Five big data challenges And how to overcome them with visual analytics”http://www.sas.com/resources/asset/five-big-data-challenges-article.pdf
過去的重點在於量夠大,便能產生質,將來則需要更focus在值
http://www.cmswire.com/cms/big-data/bigger-better-faster-stronger-the-future-of-big-data-027026.php
Better Big Data
To make big data better, we need to stop talking about how the quality of data matters less in a big data world. If quantity and repetition determined the value of data, we would probably assume that every Twitter utterance by Kim Kardashian and Justin Bieber would be more meaningful than the combined works of Shakespeare. Although some big data scientist of the future may look back at the 21st century and determine that this is the case, this finding would only prove that we as a culture had never solved the true challenges of big data.
We are only starting to get to a point where we are truly able to focus on the quality of big data. Wikipedia has over 70,000 active contributors to clean up its big data and to keep the environment clean over time. As an open community, Wikipedia has become the standard of showing how the quality and improvement of big data can actually occur.
This evolution is still only at its starting point. Pure programmatic automation efforts to make data "better" currently lack the nuance and contextual knowledge to result in improved recommendations. In truth, the vast majority of enterprise data is typically siloed or otherwise inaccessible to the employees, partners and customers who would actually be able to correct the problem. And we are only starting to see the launch of self-service and automated data quality tools that will give line-of-business employees the power to fix their own data with startup software from Paxata, Trifacta, Tamr, and the efforts of larger vendors such asIBM's Watson Analytics and Informatica's Springbok. Until we put the power of data quality into the hands of the masses, big data will struggle to become better.
Stronger Big Data
Another key issue with big data — especially as it continues to outstrip the volumes of traditional data solutions — is the challenge of maintaining its purity and context. This challenge ranges from the high level challenges of business continuity and disaster recovery to the most granular challenges of data corruption. It's handled by a combination of data scanning, file detection, data replication, data integration and data recovery. But in between all of these areas are gaps that prevent big data from being as strong and resilient as it needs to be.
Regardless of volume, velocity and variety, big data must be effectively stored, transferred, transformed and analyzed without threatening the original data. This means that companies must figure out how to bring their storage, transfer, recovery and scrubbing activities together into an integrated big data resiliency department.
One of the biggest challenges to these efforts is to synchronize internal enterprise data scrubbing and replication efforts with similar efforts conducted by managed cloud service vendors. Although the techniques, technologies and efforts may be similar in nature, even simple operational challenges such as matching the frequency and performance metrics across a hybrid environment can be difficult to manage. But as we think about the strength of big data, it is increasingly important to bridge the gaps between firmware error detection, data integrity, data scrubbing, data replication and data management.