Getting Started with
Big Data
and Making an Impact

@TOUG Jan. 22, 2014
Ian Abramson
EPAM Systems, Canada
January 2014

Co...
Big Data .. The Silver Bullet?

Confidential
Agenda
Introductions and Goals

What is Big Data

Technology Choices

Making an Impact with Data Science

Use Cases

Confi...
About Me
•
•
•
•
•
•
•
•
•
•
•

Degree in Applied Mathematics
Over 20 years with Oracle software
Over 10 years with data w...
WHERE IS BIG DATA?

5
Why Big data?
• New data sources

• Unprecedented volume
• Real World Issues
– Data Systems are reaching capacity
requirin...
“Data
becomes
“Big Data”,
when the
size of the
data
becomes a
part of the
problem”

Roger Magoulas
(O’Reily Research)

Big...
The Attributes of Big Data
• Classic Data Attributes:
– Volume
– Velocity
– Variety

• Big Data Technical Attributes
– mas...
Challenges for Big Data

http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx

Confidential

9
Big Data and Data Warehouse – war or peaceful
coexistence?

•

The problem – different uses – different schemas and differ...
The New Data Architecture
Data Archive

Operational Systems

Enterprise Data
Social & Clickstream
Sensor Generated

Big Da...
TECHNOLOGY CHOICES

Confidential

12
The Choices for Your Data

RDBMS
- High Concurrency
- TB Storage
- Indexed reads
- Efficient updates
- Caching
- Highly se...
The Open Source/Big Data Landscape

http://www.bigdata-startups.com/open-source-tools/

14
Hadoop In Detail

Reference: http://blog.blazeclan.com/252/

Confidential

15
Hadoop Distributions

Confidential

16
For Example if you choose Cloudera…

Confidential

17
Comparing Hadoop Distributions

http://www.infoworld.com/d/business-intelligence/enterprise-hadoop-big-data-processing-mad...
Big Data’s Technical Challenges
• Disaster recovery
• Security

• Data consistency
• Workload management

• Reprocessing
•...
DATA SCIENCE

Confidential

20
Big Data vs. BI presentation viewpoint

IMPACT

Confidential

21
Questions for BI and Big Data
• Sample questions for BI
– What is my sales volume by time, by region, by store, by season?...
DATA SCIENCE

Data Science

Skills

Science
Purpose

• State the Problem

Research

• Discover information
about topic

Hy...
Data Science Team
Each team would include:
•

Data Science Analyst – excellent communication skills, science and analytica...
USING BIG DATA

Confidential

25
Data Science Sample use cases

Confidential

26
Top 10 Use Cases (2013 Computerworld)
1. Modeling Risk
2. Customer Churn Analysis
3. Recommendation Engines
4. Ad Targetin...
The Big Data of Dating
•

From analysis of match.com dating patterns:

•

21+ Million members

•

100+ million hits per mo...
Use Case Development
Business
Stakeholders

Business
Questions

Identify
Business
Value
Define
Success
Criteria

Develop
H...
Use Case Checklist
• Title - An active description which identifies the goals of the
primary actor

• Characteristics:
–

...
EXPEDIA CASE STUDY

Archive Use Case
1.5 Petabytes continuous ingestion data

One of the largest Hadoop clusters in the
wo...
Use Case: Sales Analysis
Sales per sq.ft.: Changes Over time
• Fitting the no-intercept line to the scatter of sales over ...
Looking for Patterns Anomalies
This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks o...
Affinity Analysis Use Case
Build model that provides the foundation for analyzing and
understanding the factors that influ...
Clustering of Products

35
Snow Scrapers and Washer Fluid

36
Related Baskets

Size of the circle show how often
basket has been purchased
Season: 2012-05-16 - 2012-08-28
This kind of ...
Big Data is Evolving
• The industry is evolving
• Hadoop is now 8 years old since start in 2007 at Yahoo
• CDH 5 recently ...
The Big Data Adventure
Thank You and Questions
Ian Abramson
EPAM Systems
Toronto, Canada
GMT -5
Mobile phone:
Skype:
E-mail:

+1 (416) 254-9286
i...
Upcoming SlideShare
Loading in …5
×

TOUG Big Data Challenge and Impact

1,388 views

Published on

Presented to Toronto Oracle Users Group members on Jan 22, 2014 by Ian Abramson

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,388
On SlideShare
0
From Embeds
0
Number of Embeds
580
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

TOUG Big Data Challenge and Impact

  1. 1. Getting Started with Big Data and Making an Impact @TOUG Jan. 22, 2014 Ian Abramson EPAM Systems, Canada January 2014 Confidential
  2. 2. Big Data .. The Silver Bullet? Confidential
  3. 3. Agenda Introductions and Goals What is Big Data Technology Choices Making an Impact with Data Science Use Cases Confidential 3
  4. 4. About Me • • • • • • • • • • • Degree in Applied Mathematics Over 20 years with Oracle software Over 10 years with data warehouses Big Data Analyst Author of numerous Oracle books Blogger: http://ians-oracle.blogspot.com/ Oracle ACE IOUG Past-President TOUG Board Member Toronto based Twitter: @iabramson 4
  5. 5. WHERE IS BIG DATA? 5
  6. 6. Why Big data? • New data sources • Unprecedented volume • Real World Issues – Data Systems are reaching capacity requiring high cost alternatives – Archive data is too far offline – Organizations require cost effective options – Retain all data for future analysis 6
  7. 7. “Data becomes “Big Data”, when the size of the data becomes a part of the problem” Roger Magoulas (O’Reily Research) Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. Gartner: Big Data is a term/concept, which is used as a generic name for a “generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling highvelocity capture, discovery, and/or analysis”. IDC: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Wikipedia: Big Data Defined 7
  8. 8. The Attributes of Big Data • Classic Data Attributes: – Volume – Velocity – Variety • Big Data Technical Attributes – massive, parallel computing environment – infinitely scalable computing clusters, including cloud • Three main technical requirements – Need medium to accommodate large volumes for storage and data streaming – Require the computing horsepower and architectural approach which allows for the processing of the data where it exists and not via extraction and processing – Use the appropriate programming which allows for a computational paradigm, which performs computations in a highly parallel and scalable environment 8
  9. 9. Challenges for Big Data http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx Confidential 9
  10. 10. Big Data and Data Warehouse – war or peaceful coexistence? • The problem – different uses – different schemas and different partitioning. In most cases the requirements are orthogonal – impossible to provide optimal for everybody data partitioning/indexing • The ideal goal – acquire and store “as is” – access using multiple models. Need for powerful artificial intelligence knowledge base and data access code generators. • Will never be optimal for everybody unless huge redundancy • Problems are less painful if most of the data are read anyway. Good for analytics, not good for OLTP • Eventually Big data platforms will become DW platforms with well developed access interfaces • Until then -> acquire and store and then distribute on demand to conventional DW and data marts 10
  11. 11. The New Data Architecture Data Archive Operational Systems Enterprise Data Social & Clickstream Sensor Generated Big Data ODS Hadoop Public Data HDFS Map/Reduce Historical Data Data Warehouse/BI/Analytics Other New Sources 11
  12. 12. TECHNOLOGY CHOICES Confidential 12
  13. 13. The Choices for Your Data RDBMS - High Concurrency - TB Storage - Indexed reads - Efficient updates - Caching - Highly secure Analytic Appliances - Scalable - Medium Concurrency - High Volume Processing (Postgres) - No indexes - TB + Netezza (128TB/rack) Oracle (300TB/Rack) NoSQL - Highly Scalable - High Concurrency - Storage Options - Updates - Real-time Capable - Rudimentary indexes - TB + Capacity Hadoop - Highly scalable - Low concurrency - Distributed Storage - Complex Access - Security (TBD)
  14. 14. The Open Source/Big Data Landscape http://www.bigdata-startups.com/open-source-tools/ 14
  15. 15. Hadoop In Detail Reference: http://blog.blazeclan.com/252/ Confidential 15
  16. 16. Hadoop Distributions Confidential 16
  17. 17. For Example if you choose Cloudera… Confidential 17
  18. 18. Comparing Hadoop Distributions http://www.infoworld.com/d/business-intelligence/enterprise-hadoop-big-data-processing-made-easier-184330?page=0,5 Confidential 18
  19. 19. Big Data’s Technical Challenges • Disaster recovery • Security • Data consistency • Workload management • Reprocessing • Troubleshooting • Performance 19
  20. 20. DATA SCIENCE Confidential 20
  21. 21. Big Data vs. BI presentation viewpoint IMPACT Confidential 21
  22. 22. Questions for BI and Big Data • Sample questions for BI – What is my sales volume by time, by region, by store, by season? – What is average review rating by product category, by product? What is the dynamic of reviews, what are the trends? • Sample questions for Big Data/ Data Science – How change in review ratings impact sales? – What is the time lag between review rating change and sales volume change? – What products are purchased together and can I improve product recommendations? Confidential 22
  23. 23. DATA SCIENCE Data Science Skills Science Purpose • State the Problem Research • Discover information about topic Hypothesis • Predict the Outcome Experiment Analysis Conclusion Confidential • Develop a process to test the hypothesis • Record the results • Compare hypothesis and results 23
  24. 24. Data Science Team Each team would include: • Data Science Analyst – excellent communication skills, science and analytical background. • Data Science Researcher/Solution Architect – good communication,, good statistical/math, working knowledge 2 out of the following data science libraries (Mahoot or any other machine learning, Rhadoop, R, SAS, SPSS) – • Data Science Technologist – acceptable communication skills, 25% deployable to the client site (as minimum few should be deployable, others can be offshore), good developer, working knowledge of Big data and related technologies • Data Science presentation engineer – knowledge BI and presentation tools Nordstrom’s Big Data Team Mission: “Delighting Customers through data-driven products” 24
  25. 25. USING BIG DATA Confidential 25
  26. 26. Data Science Sample use cases Confidential 26
  27. 27. Top 10 Use Cases (2013 Computerworld) 1. Modeling Risk 2. Customer Churn Analysis 3. Recommendation Engines 4. Ad Targeting 5. POS Transaction Analysis 6. Analysis of network data to predict future failures 7. Threat Analysis 8. Trade Surveillance 9. Search Quality 10.Data Sandbox http://www.computerworld.com.sg/resource/storage/iiis-2013-technical-workshops/?page=2
  28. 28. The Big Data of Dating • From analysis of match.com dating patterns: • 21+ Million members • 100+ million hits per month – January 2nd is the busiest day for people to sign up on dating sites – Women get 60% more attention if photo is taken indoors – Men get 19% more attention if theirs is taken outside – Full-body photos boost both sexes success by 203% – Posing with animals or your best friends might seem cute but it actually reduces your popularity by 53 per cent (men) and 42 per cent (women) – Men get 8% fewer messages if they put up selfies. – Mentions of words like divorce and separated gets men 52 per cent more messages – Women who are more forward, using phrases like dinner, drinks or lunch in the first message get 73 per cent more replies, while men should play it cooler. Those who mention the same words in their opening message get 35 per cent fewer replies. Confidential 28
  29. 29. Use Case Development Business Stakeholders Business Questions Identify Business Value Define Success Criteria Develop Hypothesis and Identify Data Sources Iterate results and develop data for goals
  30. 30. Use Case Checklist • Title - An active description which identifies the goals of the primary actor • Characteristics: – Primary actor – Goal in Context – Scope – Level – Stakeholders and Interests – Precondition • Success criteria – Precondition – Minimal Guarantees – Success Guarantees – Trigger – Main Success Scenario – Extensions • Technology & Data Variations List • Related Information. Reference: Alistair Cockburn
  31. 31. EXPEDIA CASE STUDY Archive Use Case 1.5 Petabytes continuous ingestion data One of the largest Hadoop clusters in the world 80% Open Source EDW Staging and Historical Analysis Call Center and Online data Customer Benefits  Avoided massive cost of new DW Infrastructure  Able to keep and analyze historical transactions Informatica transformation & aggregation  Reduce risk of DW replacement  Able to scale on demand using low-cost servers Transaction Volume  > 500 GB daily increases from all sources transaction, social, contact center Analytic Infrastructure 31
  32. 32. Use Case: Sales Analysis Sales per sq.ft.: Changes Over time • Fitting the no-intercept line to the scatter of sales over sales floor brings about visual baseline Sales-per-Sq.Ft. (SpSF) for each year Mathematically the SpSF measure is given by the slope coefficient of the trend: 392.51 [CAD/Sq.Ft.] in 2011 vs. 373.76 [CAD/Sq.Ft.] in 2012 417 in 2011 417 in 2012 SpSF
  33. 33. Looking for Patterns Anomalies This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks on Friday and Is also doing well on Mondays. Why? 10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 THU FRI SAT SUN MON TUE WED
  34. 34. Affinity Analysis Use Case Build model that provides the foundation for analyzing and understanding the factors that influence year over year change in store performance • Affinity Analysis is an input to: • • • • • Identify products purchased in tandem Provide guidance an recommendations for upsell and cross-sell Redesign stores, layouts and planograms Discount Plans and Promotions Identifying customer baskets in different time and geography • • Investigating patterns on fine line and product levels Ranking customer baskets by Number of times bought together Revenue contributed
  35. 35. Clustering of Products 35
  36. 36. Snow Scrapers and Washer Fluid 36
  37. 37. Related Baskets Size of the circle show how often basket has been purchased Season: 2012-05-16 - 2012-08-28 This kind of analysis can be used for spotting driver products 1. 2. 3. 4. Potted annuals/plants, Cell-packs/annual plants Potted annuals/plants, vegetables/plants Potted annuals/plants, Outdoor soils/outdoor lawn & plant care Cell-packs/annual plants, vegetables/annual plants
  38. 38. Big Data is Evolving • The industry is evolving • Hadoop is now 8 years old since start in 2007 at Yahoo • CDH 5 recently released • $2.5B in venture capital in the space • Hadoop is now considered a standard • Hbase is an example of a project which has not found a standard • Many tools today? What will be in 5 years from now? • How to avoid the big data pitfalls? • 50% of big data projects fail • Those who success drive it by focus • Insight vs. Impact • Find one problem and fix it • Data Science • Change how you do analysis… scientific methods • New and exciting • Build a hybrid team to develop Data solutions • Team can program, knows math and statistics and communicate Confidential 38
  39. 39. The Big Data Adventure
  40. 40. Thank You and Questions Ian Abramson EPAM Systems Toronto, Canada GMT -5 Mobile phone: Skype: E-mail: +1 (416) 254-9286 ian.abramson Ian_Abramson@epam.com Confidential 40

×