Your SlideShare is downloading. ×
0
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

TOUG Big Data Challenge and Impact

712

Published on

Presented to Toronto Oracle Users Group members on Jan 22, 2014 by Ian Abramson

Presented to Toronto Oracle Users Group members on Jan 22, 2014 by Ian Abramson

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
712
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Getting Started with Big Data and Making an Impact @TOUG Jan. 22, 2014 Ian Abramson EPAM Systems, Canada January 2014 Confidential
  2. Big Data .. The Silver Bullet? Confidential
  3. Agenda Introductions and Goals What is Big Data Technology Choices Making an Impact with Data Science Use Cases Confidential 3
  4. About Me • • • • • • • • • • • Degree in Applied Mathematics Over 20 years with Oracle software Over 10 years with data warehouses Big Data Analyst Author of numerous Oracle books Blogger: http://ians-oracle.blogspot.com/ Oracle ACE IOUG Past-President TOUG Board Member Toronto based Twitter: @iabramson 4
  5. WHERE IS BIG DATA? 5
  6. Why Big data? • New data sources • Unprecedented volume • Real World Issues – Data Systems are reaching capacity requiring high cost alternatives – Archive data is too far offline – Organizations require cost effective options – Retain all data for future analysis 6
  7. “Data becomes “Big Data”, when the size of the data becomes a part of the problem” Roger Magoulas (O’Reily Research) Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making. Gartner: Big Data is a term/concept, which is used as a generic name for a “generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling highvelocity capture, discovery, and/or analysis”. IDC: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Wikipedia: Big Data Defined 7
  8. The Attributes of Big Data • Classic Data Attributes: – Volume – Velocity – Variety • Big Data Technical Attributes – massive, parallel computing environment – infinitely scalable computing clusters, including cloud • Three main technical requirements – Need medium to accommodate large volumes for storage and data streaming – Require the computing horsepower and architectural approach which allows for the processing of the data where it exists and not via extraction and processing – Use the appropriate programming which allows for a computational paradigm, which performs computations in a highly parallel and scalable environment 8
  9. Challenges for Big Data http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx Confidential 9
  10. Big Data and Data Warehouse – war or peaceful coexistence? • The problem – different uses – different schemas and different partitioning. In most cases the requirements are orthogonal – impossible to provide optimal for everybody data partitioning/indexing • The ideal goal – acquire and store “as is” – access using multiple models. Need for powerful artificial intelligence knowledge base and data access code generators. • Will never be optimal for everybody unless huge redundancy • Problems are less painful if most of the data are read anyway. Good for analytics, not good for OLTP • Eventually Big data platforms will become DW platforms with well developed access interfaces • Until then -> acquire and store and then distribute on demand to conventional DW and data marts 10
  11. The New Data Architecture Data Archive Operational Systems Enterprise Data Social & Clickstream Sensor Generated Big Data ODS Hadoop Public Data HDFS Map/Reduce Historical Data Data Warehouse/BI/Analytics Other New Sources 11
  12. TECHNOLOGY CHOICES Confidential 12
  13. The Choices for Your Data RDBMS - High Concurrency - TB Storage - Indexed reads - Efficient updates - Caching - Highly secure Analytic Appliances - Scalable - Medium Concurrency - High Volume Processing (Postgres) - No indexes - TB + Netezza (128TB/rack) Oracle (300TB/Rack) NoSQL - Highly Scalable - High Concurrency - Storage Options - Updates - Real-time Capable - Rudimentary indexes - TB + Capacity Hadoop - Highly scalable - Low concurrency - Distributed Storage - Complex Access - Security (TBD)
  14. The Open Source/Big Data Landscape http://www.bigdata-startups.com/open-source-tools/ 14
  15. Hadoop In Detail Reference: http://blog.blazeclan.com/252/ Confidential 15
  16. Hadoop Distributions Confidential 16
  17. For Example if you choose Cloudera… Confidential 17
  18. Comparing Hadoop Distributions http://www.infoworld.com/d/business-intelligence/enterprise-hadoop-big-data-processing-made-easier-184330?page=0,5 Confidential 18
  19. Big Data’s Technical Challenges • Disaster recovery • Security • Data consistency • Workload management • Reprocessing • Troubleshooting • Performance 19
  20. DATA SCIENCE Confidential 20
  21. Big Data vs. BI presentation viewpoint IMPACT Confidential 21
  22. Questions for BI and Big Data • Sample questions for BI – What is my sales volume by time, by region, by store, by season? – What is average review rating by product category, by product? What is the dynamic of reviews, what are the trends? • Sample questions for Big Data/ Data Science – How change in review ratings impact sales? – What is the time lag between review rating change and sales volume change? – What products are purchased together and can I improve product recommendations? Confidential 22
  23. DATA SCIENCE Data Science Skills Science Purpose • State the Problem Research • Discover information about topic Hypothesis • Predict the Outcome Experiment Analysis Conclusion Confidential • Develop a process to test the hypothesis • Record the results • Compare hypothesis and results 23
  24. Data Science Team Each team would include: • Data Science Analyst – excellent communication skills, science and analytical background. • Data Science Researcher/Solution Architect – good communication,, good statistical/math, working knowledge 2 out of the following data science libraries (Mahoot or any other machine learning, Rhadoop, R, SAS, SPSS) – • Data Science Technologist – acceptable communication skills, 25% deployable to the client site (as minimum few should be deployable, others can be offshore), good developer, working knowledge of Big data and related technologies • Data Science presentation engineer – knowledge BI and presentation tools Nordstrom’s Big Data Team Mission: “Delighting Customers through data-driven products” 24
  25. USING BIG DATA Confidential 25
  26. Data Science Sample use cases Confidential 26
  27. Top 10 Use Cases (2013 Computerworld) 1. Modeling Risk 2. Customer Churn Analysis 3. Recommendation Engines 4. Ad Targeting 5. POS Transaction Analysis 6. Analysis of network data to predict future failures 7. Threat Analysis 8. Trade Surveillance 9. Search Quality 10.Data Sandbox http://www.computerworld.com.sg/resource/storage/iiis-2013-technical-workshops/?page=2
  28. The Big Data of Dating • From analysis of match.com dating patterns: • 21+ Million members • 100+ million hits per month – January 2nd is the busiest day for people to sign up on dating sites – Women get 60% more attention if photo is taken indoors – Men get 19% more attention if theirs is taken outside – Full-body photos boost both sexes success by 203% – Posing with animals or your best friends might seem cute but it actually reduces your popularity by 53 per cent (men) and 42 per cent (women) – Men get 8% fewer messages if they put up selfies. – Mentions of words like divorce and separated gets men 52 per cent more messages – Women who are more forward, using phrases like dinner, drinks or lunch in the first message get 73 per cent more replies, while men should play it cooler. Those who mention the same words in their opening message get 35 per cent fewer replies. Confidential 28
  29. Use Case Development Business Stakeholders Business Questions Identify Business Value Define Success Criteria Develop Hypothesis and Identify Data Sources Iterate results and develop data for goals
  30. Use Case Checklist • Title - An active description which identifies the goals of the primary actor • Characteristics: – Primary actor – Goal in Context – Scope – Level – Stakeholders and Interests – Precondition • Success criteria – Precondition – Minimal Guarantees – Success Guarantees – Trigger – Main Success Scenario – Extensions • Technology & Data Variations List • Related Information. Reference: Alistair Cockburn
  31. EXPEDIA CASE STUDY Archive Use Case 1.5 Petabytes continuous ingestion data One of the largest Hadoop clusters in the world 80% Open Source EDW Staging and Historical Analysis Call Center and Online data Customer Benefits  Avoided massive cost of new DW Infrastructure  Able to keep and analyze historical transactions Informatica transformation & aggregation  Reduce risk of DW replacement  Able to scale on demand using low-cost servers Transaction Volume  > 500 GB daily increases from all sources transaction, social, contact center Analytic Infrastructure 31
  32. Use Case: Sales Analysis Sales per sq.ft.: Changes Over time • Fitting the no-intercept line to the scatter of sales over sales floor brings about visual baseline Sales-per-Sq.Ft. (SpSF) for each year Mathematically the SpSF measure is given by the slope coefficient of the trend: 392.51 [CAD/Sq.Ft.] in 2011 vs. 373.76 [CAD/Sq.Ft.] in 2012 417 in 2011 417 in 2012 SpSF
  33. Looking for Patterns Anomalies This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks on Friday and Is also doing well on Mondays. Why? 10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000000 1000000 0 THU FRI SAT SUN MON TUE WED
  34. Affinity Analysis Use Case Build model that provides the foundation for analyzing and understanding the factors that influence year over year change in store performance • Affinity Analysis is an input to: • • • • • Identify products purchased in tandem Provide guidance an recommendations for upsell and cross-sell Redesign stores, layouts and planograms Discount Plans and Promotions Identifying customer baskets in different time and geography • • Investigating patterns on fine line and product levels Ranking customer baskets by Number of times bought together Revenue contributed
  35. Clustering of Products 35
  36. Snow Scrapers and Washer Fluid 36
  37. Related Baskets Size of the circle show how often basket has been purchased Season: 2012-05-16 - 2012-08-28 This kind of analysis can be used for spotting driver products 1. 2. 3. 4. Potted annuals/plants, Cell-packs/annual plants Potted annuals/plants, vegetables/plants Potted annuals/plants, Outdoor soils/outdoor lawn & plant care Cell-packs/annual plants, vegetables/annual plants
  38. Big Data is Evolving • The industry is evolving • Hadoop is now 8 years old since start in 2007 at Yahoo • CDH 5 recently released • $2.5B in venture capital in the space • Hadoop is now considered a standard • Hbase is an example of a project which has not found a standard • Many tools today? What will be in 5 years from now? • How to avoid the big data pitfalls? • 50% of big data projects fail • Those who success drive it by focus • Insight vs. Impact • Find one problem and fix it • Data Science • Change how you do analysis… scientific methods • New and exciting • Build a hybrid team to develop Data solutions • Team can program, knows math and statistics and communicate Confidential 38
  39. The Big Data Adventure
  40. Thank You and Questions Ian Abramson EPAM Systems Toronto, Canada GMT -5 Mobile phone: Skype: E-mail: +1 (416) 254-9286 ian.abramson Ian_Abramson@epam.com Confidential 40

×