Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Data Science and Real Time Analytics at Scale

581 views

Published on

Gaining business advantages from big data is moving beyond just the efficient storage and deep analytics on diverse data sources to using AI methods and analytics on streaming data to catch insights and take action at the edge of the network.

https://hortonworks.com/webinar/accelerating-data-science-real-time-analytics-scale/

Published in: Technology
  • Doctor's 2-Minute Ritual For Shocking Daily Belly Fat Loss! Watch This Video ■■■ https://tinyurl.com/y6qaaou7
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Accelerating Data Science and Real Time Analytics at Scale

  1. 1. Accelerating Data Science and Real-Time Analytics at Scale Nadeem Asghar, Hortonworks, Field CTO and Global Head Partner Engineering Steve Roberts, IBM, Big Data Offering Manager
  2. 2. Data Time Available Data Understood Data Enterprise Amnesia 80 million wearable health devices will be available by 2017. 2.5 quintillion bytes of data generated daily by connected machines. There will be 28 times more sensor- enabled devices than people by the year 2020. 25 gigabytes of data per hour is generated by a connected car. 90% of cars will be connected by 2020. 153 exabytes of healthcare data generated by devices in 2013. Increasing to 2,314 exabytes in 2020. 1.7 megabytes of data per second generated by every human being on the planet by 2020.
  3. 3. Centralized Mainframes Cognitive Era E-Business Distributed Computing Smarter Planet Office Productivity Client/ Server Personal Computer Data Warehousing Big Data & Predictive Analytics Cognitive A New Era of Computing Has Emerged Data InsightContext Transactional Database Business Intelligence Big Data & Analytics Actionable Insight in context Reporting Cloud
  4. 4. © 2018 IBM Corporation A recruiting and HR company, chose an IBM & Hortonworks full stack solution to support their Hadoop/Spark workloads and accelerate their analytics and AI projects Business problem Job-matching is their core business and accuracy and speed of this matching is critical to their success. This requires the intake and analysis of terabytes of data daily – including recruiter and company information, job listings, hiring histories, and resumes. Future requirement to apply AI to more complex data such as images, sound and video. Benefits • Proven performance • World class support • Reliable security for personal data • Built on open technologies, avoiding vendor lock-in • Scalable software defined storage proven for analytics • POWER9 and PowerAI supports their AI research and development From Data to AIIntelligent Job Matching
  5. 5. accident risk rate 90% inspection times 10X number of inspections AI at the Edge
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved à #1 Pure Open Source Hadoop Distribution à 1000+ customers and 2100+ ecosystem partners à Employs the original architects, developers and operators of Hadoop from Yahoo! à Best-in-class 24x7 customer support à Leading professional services and training à Data Science Leader à OpenPOWERperformance leadership à Flexible, software defined storage à #1 SQL Engine for complex, analytical workloads à Leader in On-premise and Hybrid Cloud solutions + IBM + Hortonworks = Unlocking Actionable Insights
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA – More Volume and More Types I N C R EAS I N G D ATA V AR I ETY AN D C O MP L EX I TY USER GENERATED CONTENT MOBILE WEB SMS/MMS SENTIMENT EXTERNAL DEMOGRAPHICS HD VIDEO SPEECH TO TEXT PRODUCT/ SERVICE LOGS SOCIAL NETWORK BUSINESS DATA FEEDS USER CLICK STREAM WEB LOGS OFFER HISTORY DYNAMIC PRICING A/B TESTING AFFILIATE NETWORKS SEARCH MARKETING BEHAVIORAL TARGETING DYNAMIC FUNNELSPAYMENT RECORD SUPPORT CONTACTS CUSTOMER TOUCHESPURCHASE DETAIL PURCHASE RECORD SEGMENTATIONOFFER DETAILS P E T A B Y T E S T E R A B Y T E S G I G A B Y T E S E X A B Y T E S ERP BIG DATA W EB CRM
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Analytics Must Evolve To Deal With Data Tipping Point PROVIDE INSIGHT INTO THE PAST via data aggregation, data mining, business reporting, OLAP, visualization, dashboards, etc. UNDERSTAND THE FUTURE via statistical models, forecasting techniques, machine learning, etc. ADVISE ON POSSIBLE OUTCOMES via rules, optimization and simulation algorithms
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Science and Real-Time Analytics at Scale End to End Data Science Workflow Data Engineering DISCOVER ACQUISITION PROCESSING CURATION Data Science DATA WRANGLING FEATURE ENGINING,VISUALIZATI ON AND ANALYSIS MODEL BUILDING, TRAINING AND TESTING Deployment & Operationalize REPORTS DASHBOARDS REAL-TIME SCORING BATCH SCORING REST SERVICES PERFORMANCE MGMT SCHEDULING Data Science Experience (DSX) Enterprise Services: Multi Notebook Support, Versioning, Collaboration, Model Management Hortonworks Data Platform (HDP) Enterprise Services: Data, GPU, Deep Learning, Compute, Security, Governance, Metadata, Operations Hortonworks Data Flow (HDF) Enterprise Services: Data Ingestion Schema Registry, CEP Hortonworks Data Flow (HDF) Enterprise Services: Data Ingestion Schema Registry, CEP
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case Deep Dive Credit Card Fraud Prevention
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building a Model à Show of hands, how many have built a “Model”? à What are some limitations? – Conditional based logic: if/else binary decisions à If you need a lot of data to build a good model, what tools can you use? – Data volumes can eliminate the possibility of desktop tools à Sampling? – Well… we better get an even distribution of true and false positives in each sample, but wait that requires data munging, back to what tools can we use. à Security Concerns? – Extracting data from it’s secure resting place and pushing it into other environments, often times unsecure files or desktops where Matlab or R can be installed. à Collaboration – Push processing to the data using modern distributed tooling.
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credit Card Fraud Use Case à Requirement: Detect fraudulent transactions. à Goal: Save the card company money and build trust amongst card users. Cut down on fraudulent crime à Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt and make smarter decisions over time. à Design – Distance: How far can one travel over a period of time before it is fraudulent? – Category: How can we detect a purchase that a customer wouldn’t likely make? – Frequency: How can we detect purchasing patterns that do not resemble the card holder? à Ideas? – White board some conditional logic, egregiousness vs binary – Back test the data – Build a model per card holder?
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules, Statistics, Machine Learning à Rule Based Logic – Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to over engineer. – Example: Spending Limit. Card holder limit = $2,000 • If (currentPurchaseAmount+ balance > 2,000) then deny transaction à Statistics – Mean, median, mode, variance, deviation – Anomaly detection. Outliers. (i.e. womens retail example) à Machine Learning – Supervised – Unsupervised – Trainable – Adapt over time
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Discovery à Gathered all Credit Card Transactions – Problem is they didn’t make sense – No identifiable patterns, no log normal curves – Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55 à Classification
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outlier Detection: identify abnormal patterns Example: identify anomalies Features: - Time frequency - Category - Amount - Distance
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Fraud Detection Demo Technical Architecture Real-Time Data Movement (Apache Nifi) Real Time Processing (Storm) Inbound Messaging (Kafka) D A T A I N M O T I O N D A T A I N M O T I O N Distributed Storage: HDFS Many Workloads: YARN Real-time Serving (HBase) Spark (Machine Learning) UI and HTTP PubSub (Jetty and Tomcat) Data Science (DSX) Resource Allocation (Docker) Interactive Query (Hive) Authorization (Ranger) Governance (Atlas) All Running on Top of IBM Power Hardware
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Case Demo Credit Card Fraud Prevention
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 18 Credit Fraud Analyst Inbox
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 19 Credit Fraud Analyst Investigation
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 20 Credit Fraud Analyst Action
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 21 Hortonworks Data Flow- Backbone for Bi-Directional Communication
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Summary Problems Solved • Data Scientist teams can collaborate and learn new tools on a common frameworks. • Choice of open source tools, notebooks, and languages. • Run favorite notebook on all data in their HDP cluster. • Deploy the model to production. • Leverage the production model to deliver insights to business. • Monitor the health and performance of models in production.
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 23 Improved Experience /Reduced Cost Immediate Customer Feedback Years of Customer Transaction Data Fraud Detection Complete Customer Profile Real time ingest of transactions Proactively identify potential fraudulent transactions to protect the customer and improve customer experience • Proactively monitor every credit card transaction using machine learning to catch potential fraud • Customer Service Analyst reviews flagged transactions in real time via a next generation application running on the connected platform • HDF controls real time flow of data in and out of the connected platform to the various source and destination points Innovate Renovate Purchase Behavior Insight Journey to Fraud Detection
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Science Solution Community Open Source Scale & Enterprise Security • Find tutorials and datasets • Connect with Data Scientists • Ask questions • Read articles and papers • Fork and share projects • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE and Shiny • Apache Spark • Your favorite libraries • Data Science at Scale • Run Spark Jobs on HDP Cluster • Secure Hadoop Support • Ranger Atlas Support for Data • Support for ABAC Model Management • Data Shaping Pipeline UI • Auto-data preparation & modeling • Advanced Visualizations • Model management & deployment • Documented Model APIs Data Science Experience Freedom: Choose the right tool for your team and business. Productivity: Make both experienced and novice data scientists more productive. Trust: Confidently deploy insights generated from the most current data and trends.
  25. 25. enterprise-ready software distribution built on open source tools for ease of development performance faster training times for data scientists +
  26. 26. IBM Power Systems designed to deliver breakthrough performance for data threads per core processor cache memory bandwidth open innovation +++ MOREvs. x86 + BETTER L1 ßà L4 COMMUNITY availability | scalability | reliability | serviceability get more work done fastest memory lives on cores more data than ever is flowing faster innovation and value MEANS 26
  27. 27. Accelerate Data Science with Power Systems Test results based on running a machine learning workload based on k-means clustering algorithm on data sets size ranging from 1GB to 15 GB. Test System details – Power Systems S822 LC HPC – 20 Cores, 512 GB RAM and SSD, Power Systems S822LC Big Data – 20 Cores, 512 GB, HDDs, Intel Server with Broadwell E5 2640 v4 – 20 cores, 512 GB and SSD, Intel Server with Broadwell E5 2699 v4 – 44 cores, 512 GB, HDD • Increase Data Science Team productivity • Reduce model training time − 2.5X with S822LC for HPC vs E5-2640 v4 (with SSD) − 1.5X with S822LC for Big Data vs E5-2699 v4 (with HDD) • Leverage larger datasets for model training • 2.5X larger dataset in the same time (1200 Seconds - ~5GB for x86 server E5 2640 with SSD vs 13GB for Power server S822 LC HPC with SSD) 0 600 1200 1800 2400 3000 3600 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Data Size (GB) Elapsed time to form 5 clusters in 100 Iterations using k-means clustering with one user S822LC HPC with SSD S822LC BigData with HDD E5 2699 v4 with HDD E5 2640 v4 with SSD ElapsedTime(seconds)
  28. 28. The Perfect Blend of Data Science and an Enterprise Data Lake 28 Better Together datascience.ibm.com Boost Data Science Team Productivity: model training in less than half the time versus x86 Blazing Fast Insights for Line of Business: A 1.7x improvement in time to result Secure and Reliable Data Access at Scale: Open, comprehensive data lifecycle and security management on the most reliable servers. For clients building a high performing Data Science practice with a fast, scalable, enterprise Data Lake Acomplete solution of Data Science and Hadoop software, hardware and quick start services.
  29. 29. 29 © 2016 IBM Corporation Image Name Software Versions Linux Version HDP 2.6.2 HDP 2.6.2 RHEL 7.3 HDP 2.6.4 HDP 2.6.4 RHEL 7.4 HDP/HDF Security Governance Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 HDP/HDF Credit Card Fraud Detection Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 HDP/HDF IOT Trucking Demo HDP 2.6.3, HDF 3.0.3 RHEL 7.4 Hortonworks Preconfigured Images available on IBM POWER8 Size Flavor Options Description Small 8 vCPUs, 24GB memory, 50GB disk Medium 16 vCPUs, 32GB memory, 200GB disk Large 24 vCPUs, 48GB memory, 500GB disk 1. Go to IBM Power DevelopmentCloud (PDC):Link 2. Follow the Get Started process via the “Go to Program to Get Started” link and register for IBM PDC as a Partner or Open Source Developer 3. When you reach the IBM PDC “Make a Reservation” page,click Requestpromo code 4. SelectRed Hat Linux for the Image Category.Enter the vCPUs and memory using values from the size/flavor options in the table below.In Other requirements field,enter one of the Image names from the table below.Click Submit. 5. Wait for an approval email.Then,follow the instructions in the Create Reservation guide to complete your reservation. 6. On the reservations page,select the company profile that shows VMaaS, enter the Promo code received in the email,and click Apply. 7. In the next form, select the desired Flavor and Image name.
  30. 30. How to Get Started with Hortonworks on OpenPOWER Systems • Learn more about the benefits of IBM Power Systems and OpenPOWER • Join the Hortonworks Community: https://community.hortonworks.com/ • Learn more about the benefits of Hortonworks: http://hortonworks.com/training/ • Sign up for Free Data Science and Cognitive Computing courses: https://cognitiveclass.ai/ • Try the solution: IBM benchmark centers, on the cloud or on your premise
  31. 31. Q&A IBM Cloud / DOC ID / Month XX, 2017 / © 2017 IBM Corporation
  32. 32. Thank you IBM Cloud / DOC ID / Month XX, 2017 / © 2017 IBM Corporation

×