SoftServe BI/BigData Workshop in Utah

1,971 views

Published on

The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,971
On SlideShare
0
From Embeds
0
Number of Embeds
43
Actions
Shares
0
Downloads
1
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Split DW and BD
  • Split DW and BD
  • Split DW and BD
  • SoftServe BI/BigData Workshop in Utah

    1. 1. Common BI/Big Data Challenges and Solutions By Andriy Zabavskyy & Serhiy Haziyev January, 2013
    2. 2. SoftServe BI/Big Data Lunch and Learn Workshop in Utah January 30, 2013 The Common BI/Big Data Challenges and Solutions presented by seasoned SoftServe experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture). This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session. About SoftServe Inc. SoftServe, founded in 1993, is a leading global outsourced product and application development company dedicated to empowering businesses worldwide by providing end-toend capabilities from product concept to completion. Utilizing Product Development Services 2.0 (PDS 2.0), we deliver proactive solutions in the areas of SaaS/Cloud, Mobility, BI/Analytics and UI/UX for industries including Healthcare, Retail, Manufacturing, Logistics, and Infrastructure & Storage. SoftServe is a rapidly growing global company with 3,000 professionals and offices in North America, Western Europe, Russia and Ukraine.
    3. 3. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    4. 4. Typical BI Solution Data Sources Data Integration OLTP: CRM, ERP, Finance Data Warehouse Data Mining Users Predictive Prescriptive Analytics Data Warehouse OLAP cubes Data Visualization and Analysis Flat files ETL/ELT Big Data Reports Dashboards Spreadsheets Legacy System BI Tools Analysts
    5. 5. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    6. 6. Dashboard & Scorecard Client Problem: ▪ Single view from multiple sources ▪ Track performance against company targets Internet Solution: ▪ Dashboard ▪ KPI and Scorecards Server Tier
    7. 7. Dashboard & Scorecard: Implementation Software Vendors Offering Boxed solutions from big players Development Efforts Customization (e.g. SAS, SAP, IBI) Dashboard Frameworks (e.g. Tableau, QlikView, JasperSoft) Dashboard libs (JIDE libs) Custom defined KPI Integration Efforts Custom defined KPI & Custom built dashboard framework
    8. 8. Dashboard & Scorecard: Highlights • Adopting/Customizing of business lines ready solution could be painful, long and costly process • Not all dashboard solutions support multitenancy out-of-the-box
    9. 9. Self Service BI Problem: ▪ Give ability for BI users to explore and analyze data in highly customizable manner BI Users Data Model Solution: Toolset ▪ Expose to users a data model ▪ Give a toolset with data exploring and analysis capabilities OLAP In-Memory RDBMS/ NoSQL
    10. 10. Self Service BI: Implementation • OLAP engines with proper OLAP viewers • BI tools with in-memory engines and semantic/domain layers • Report Authoring Tools : – Microsoft Report Builder – JasperServer Report Designer
    11. 11. Self Service BI: Traditional vs Agile BI Trade-off Features Time to Value Self Service Collaboration Interactivity and UX Customization Data Quality Pixel-perfect Low cost solutions Traditional Agile
    12. 12. Self Service BI: Highlights • Need to educate data consumers to properly use SSBI tools • Desktop versions of many SSBI vendors are often more mature in comparison to Web tools • In-memory capabilities are limited by RAM size
    13. 13. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    14. 14. Data Integration Patterns Scheduled ETL ELT Replication EAI EII Real-time Message/Record based Large data sets Source: Microsoft EDW Architecture, Guidance and Deployment Best Practices
    15. 15. ELT Problem: • Efficiently processing very large volumes of data within ever shortening processing windows Solution: • Perform transformation steps on target platform • Set-based processing Data Warehouse Semantic Layer Load Staging Layer Transform Source Source Extract
    16. 16. ELT: Highlights • Some data integration platforms have clearly separated ETL and ELT components • Consider usage of custom scripts native to target platform vs. built-in DI component
    17. 17. ETL vs. ELT ETL Flow Advantages Disadvantages ELT  Data pipeline are used  Transformations to the data one record at a time  Intermediate data results are stored in memory  Data is loaded into the destination server  Set-based processing  Transformations and Lookups are within the SQL  Complex transformations  Intermediate results in memory is faster than persisting to disk  The power of the relational database system can be utilized for very large data sets  Large data sets could  Load on RDBMS overwhelm the memory  More disk activity  Updates are more efficient using set-based processing
    18. 18. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    19. 19. Kimball’s Multidimensional EDW Problem: • Integrate and consolidate data from heterogeneous sources • Keep data history Data Warehouse Solution: • Use multidimensional model to store data • Iterate by business lines • Integrate by conformed dimensions Data Sources
    20. 20. Kimball vs. Inmon Sources Data Integration and Data Warehousing 3NF Inmon Approach Kimball Approach Visualization
    21. 21. Kimball vs. Inmon Inmon Kimball Overall Approach Top-down Bottom-up Data orientation Subject- or data driven Process oriented Data Modeling Traditional Multidimensional Primary Audience IT professionals End users
    22. 22. DWH: Implementation • Trasitional RDBMS • Analytical Column-based RDBMS
    23. 23. DWH: Highlights Implications of column-based storage: – Additional columns vs. Junked dimensions – Update scenarios should be omitted where possible – Partitions scenario should be carefully established to support maintenance activities
    24. 24. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    25. 25. Big Data Big Data axis
    26. 26. Big Data: Hybrid Approach Problem: • Under big data circumstances: – Flexible online analytics – Access to most detailed raw data Operational and Historical Analytics Solution: • Analytical RDBMS for online analytics • NoSQL DB as source for RDBMS and most detailed row data NoSQL RDBMS/DW Source
    27. 27. Big Data: Implementation Sample of Hybrid Approach in HP Operational Analytics Architecture
    28. 28. Tape Library HDFS Disk Array Throughput (600 GB load time) 140-500 MB/s (0.3-1.2 h) 10-30 MB/s (5.5-16 h) 50-700 MB/s (0.25-4 h) 2-40 MB/s (83h) Max capacity 30-900 PB 21+ PB 16 PB ~Unlimited Max file size ~Unlimited ~Unlimited 4 – 16 TB (OSlimited) Accessibility SAN Java API, HTTP, NFS (MapR) NFS, CIFS, SAN REST, SOAP Scalability Adding cartridges Adding nodes Adding disks Pay-as-you-go Reliability Redundancy Redundancy (MapR) Redundancy 99.99% Encryption Yes Yes* Yes* Yes By datacenter By datacenter By datacenter By Amazon ? No Yes Yes Yes No No Yes Yes** 100 TB Cost $40-60K $100-200K $80-400K $132-216K/year $12-96K/year 1 PB Cost $90-140K $1-2M $0.5-4M $1.1-1.6M/year $120-360K/year 15 PB Cost $0.7-1.2M $15-30M ~$18M $9.9-15M/year $1.8-3.5M/year HIPAA Compliancy Random access Parallel processing Retention Storage Requirements Operation Storage Big Data isn’t only Hadoop Amazon S3 Amazon Glacier 5 TB 40 TB No
    29. 29. Big Data: Highlights • Clickstream analysis is a classic use case • Scheduled reports are well suited for Hadoop based reports • Majority of Self Service BI tools need relational representation of data
    30. 30. Agenda Data Visualization Data Mining Big Data Data Integration Data Warehousing
    31. 31. Prediction of Customer Loyalty Problem: Prediction • Predict customer loyalty; profitability Solution: • Logistic regression algorithm • Support vector machines DM Tool Historical Data Algorithm
    32. 32. Recommendation System Problem: Recommendation • Recommend to customers the most suitable goods Solution: DM Tool • K-means clustering algorithm • Collaborative filtering Historical Data Algorithm
    33. 33. DM Models: Implementation • Custom algorithm implementation • Statistical packages like R • Ready data mining model implementations
    34. 34. DM Models: Highlights • The approach should be: Problem -> Data Strategy -> Data analysis … and not vice versa • DM Algorithms should be carefully selected • DM Algorithms are highly dependent on business domain you create them for
    35. 35. SoftServe BI Maturity Model • Improving the business Wisdom • decision making (executives) • data mining, forecasting • Gaining business insight Knowledge • analytical reports (analysts) • dashboards, KPIs, scorecards, slice & dice, data warehouse, OLAP • Measuring and monitoring Information • consolidated reports (managers) • charts, parametrized reports, dedicated reporting database • Running the business Data • personal operational reports (workers, customers) • simple reports, OLTP or files
    36. 36. SoftServe BI/BigData Expertise Big Data and NoSQL Data Integration Data Warehouse BI Platforms
    37. 37. More Info about SoftServe BI Offerings  http://www.softserveinc.com/en-us/services/software-architecture/  http://www.softserveinc.com/en-us/services/bi-analytics/

    ×