Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Scotland 2017

540 views

Published on

Big Data & Analytics continues to redefine business. Data has transitioned from an underused asset to the lifeblood of the organisation, and a critical component of business intelligence, insight and strategy.

Big Data Scotland is the largest annual data analytics conference held in Scotland: it is supported by ScotlandIS and The Data Lab and free for delegates to attend. The conference is geared towards senior technologists and business leaders and aims to provide a unique forum for knowledge exchange, discussion and cross-pollination.

The programme will explore the evolution of data analytics; looking at key tools and techniques and how these can be applied to deliver practical insight and value. Presentations will span a wide array of topics from Data Wrangling and Visualisation to AI, Chatbots and Industry 4.0.

Key Topics

• Tools and techniques
• Corporate data culture, business processes, digital transformation
• Business intelligence, trends, decision making
• AI, Real-time Analytics, IoT, Industry 4.0, Robotics
• Security, regulation, privacy, consent, anonymization
• Data visualisation, interpretation and communication
• CRM and Personalisation

Published in: Technology

Big Data Scotland 2017

  1. 1. Welcome to Big Data Scotland 2017 #datascot
  2. 2. Mark Stephen BBC Scotland @bbcscotland #scotdata
  3. 3. Ray Bugg DIGIT @digitfyi #scotdata
  4. 4. www.digit.fyi 50,000 Monthly Page Views 30,000 Unique Visitors Monthly News, Views, Opinion, Insight
  5. 5. Our Next Event DT2018 3rd Annual Digital Transformation Conference www.digifutures.co.uk
  6. 6. Kate Goldman KBG Solutions @digitfyi #scotdata
  7. 7. New Foundations for a Data-Driven Organisation
  8. 8. 2018: A Confluence of Factors
  9. 9. The Time is Now  The shift to data driven process is underway.  Burgeoning data availability, open source, more affordable technology, and consumer demand for betterment on every front suggests it might be prime time for analytics teams.  New and advanced analytics technologies on the rise: predictive and prescriptive analytics, decision management software, smart machines, event stream processing applications and operational intelligence platforms.  As a result, the demand for analytics has moved outside specialty IT-based communities and is now being headed up largely by business units as well.
  10. 10. Where’s the Beef?  Building Blocks  Fundamentally, building out the underlying data architecture as well as data collection or generation capabilities.  Switch from legacy data systems to a more nimble and exible architecture to store and harness big data.  Digitize operations more fully in order to capture more data from customer interactions, supply chains, equipment, and internal processes.  But… Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.  Secondary (but no less important), is to find the answers to the innumerable business opportunities your organisation faces in your data.
  11. 11. So Here is What you Need to Do. Now.
  12. 12. 5. Silo Busting to Let the Data Flow  Silos, created because of structural issues, political factors, growth challenges, or vendor lock-in can create significant challenges to creating insight and advantage from data.  The pressure is on, but incremental practical, methodical approaches are best for those without the luxury of building from scratch.  Find a good target.  Use engagement practices to identify high-value opportunities.  Analyze business needs, choose a problem where data could provide a tangible benefit and value… perhaps in enhancing sales or preemptive incident response.  Draw in the data from around the organization and invest in these use cases first. This is not a proof of concept — you should do these earlier as a way of identifying opportunities — but a banner project that can drive subsequent investments.  Tie the integration to its application, so you get value early.  Every Progressive Step, moves towards integration. (But beware planning fallacy!)
  13. 13. Dan Fiehn Markerstudy Group @danfiehn #scotdata
  14. 14. DATA INNOVATION The Gap Smart Automation Smart Learning Smart Devices
  15. 15. THE GAP
  16. 16. Device Obsession The Gap Smart Automation Smart Learning Smart Devices
  17. 17. Exceptional Experience The Gap Smart Automation Smart Learning Smart Devices
  18. 18. Connect Anywhere The Gap Smart Automation Smart Learning Smart Devices
  19. 19. Systems Unable to Cope The Gap Smart Automation Smart Learning Smart Devices
  20. 20. Capability to Change The Gap Smart Automation Smart Learning Smart Devices
  21. 21. SMART AUTO- MATION
  22. 22. Predicting Issues The Gap Smart Automation Smart Learning Smart Devices
  23. 23. Intelligent Services The Gap Smart Automation Smart Learning Smart Devices
  24. 24. The Gap Smart Automation Smart Learning Smart Devices Autonomous Teams
  25. 25. SMART LEARNING
  26. 26. Award Winning AUTOMATION The Gap Smart Automation Smart Learning Smart Devices
  27. 27. The Gap Smart Automation Smart Learning Smart Devices
  28. 28. +34%Machine Beats Man The Gap Smart Automation Smart Learning Smart Devices
  29. 29. 1. Model Production 2. Data Enrichment 3. Scenario Simulations 4. Model Deployment 5. AI Aggregation 6. Actionable Insights Intelligence Engine The Gap Smart Automation Smart Learning Smart Devices
  30. 30. IncreaseReduction Customer Retention Sweet Spot The Gap Smart Automation Smart Learning Smart Devices
  31. 31. Insurance in a TRiCE The Gap Smart Automation Smart Learning Smart Devices
  32. 32. SMART DEVICES
  33. 33. For the first time in history, greater insight into our driver behaviour. The Gap Smart Automation Smart Learning Smart Devices
  34. 34. ARE YOU DRIVING DATA INNOVATION?
  35. 35. DATAINNOVATION uk.linkedin.com/in/danfiehn @danfiehn No algorithms were hurt in the m GAP AUTOMATION LEARNING DEVICES
  36. 36. Malachy Devlin Clyde Space @ClydeSpace #scotdata
  37. 37. Advancing your Data Analytics Capabilities Malachy Devlin Big Data - Edinburgh | 7th December 2017
  38. 38. Company Journey – Initial Data 0 1 2 3 4 5 6 May-16 Sep-17 Feb-19 Jun-20 Oct-21 Mar-23 Company Revenue 0 20 40 60 80 100 120 140 May-16 Sep-17 Feb-19 Jun-20 Oct-21 Mar-23 Company Growth
  39. 39. Data Query Language - Relational Data EQL
  40. 40. Relational Data SQL1. Migrate to Databases • Data 2. Migrate to MRP/ERP • Data • Business Logic Manufacture Customers Finance Quality HR Supply Chain Management
  41. 41. Manufacturing/Entreprise Resource Planning Manufacture Customers Finance Quality HR Supply Chain Management Business Project •IT is an enabler Data Matured? •Ready for extract/Transform/Load (ETL) Right Software for Right Requirements • Different companies, different formula Business Process Upgrade •Don’t implement bad processes Training •Tools and Process Team - authority & understanding • Budget, Resource, Technology • Processes, Data, Requirements Going Live is not the end game!
  42. 42. MRP/ERP - Real Time Business • End to End records • Critical for some markets • Single point of truth • Portfolio not Project view • Accelerate intelligence • 2 week to 1 minute in SC • 15 minute to 15 seconds production status • Materials/Labour variability • More managed data recorded, reduced effort • Order to customer receipt <18hours • Partner Integration • Port Merger Integrations Agility Accuracy TraceabilityInformed
  43. 43. The traditional space industry Copyright © 2017 Clyde Space Ltd. All rights reserved.
  44. 44. The future of nanosatellite technology
  45. 45. Quality innovation Copyright © 2017 Clyde Space Ltd. All rights reserved.
  46. 46. Copyright © 2017 Clyde Space Ltd. All rights reserved.
  47. 47. SmallSat market Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  48. 48. Where do satellites fit in? Copyright © 2017 Clyde Space Ltd. All rights reserved. Satellites are used to provide the main network connectivity for a localised area which in turn relies on localised network (PAN or LAN) to distribute this connectivity between devices within this area.
  49. 49. IoT from Space Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  50. 50. Earth observation applications SeaHawk ▪ Pair of 3U CubeSats flying multispectral imagers ▪ x30 smaller than existing SeaWIFS asset ▪ Putting Moore’s Law into space ▪ Revisit rate for IOD of 7 days ▪ Downlinked via X-band to NASA NEN Copyright © 2017 Clyde Space Ltd. All rights reserved.
  51. 51. Earth observation applications PICASSO ▪ ESA and BISA mission ▪ Hyperspectral imager supplied by VTT ▪ Ozone monitoring mission ▪ Single 3U IOD Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  52. 52. IPP Mission: FireSat Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  53. 53. Quantum applications Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  54. 54. Thank You
  55. 55. Launch and deployment
  56. 56. Clyde Space nanosatellite solutions Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  57. 57. Launch and deployment Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  58. 58. Ground control Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved. Clyde Space ground station solution: ▪ Based at our headquarters in Glasgow, Scotland ▪ Further sites established in ME and US ▪ SDR based architecture ▪ Fully automated ▪ Working to develop partnerships with recognised and emerging ground station solutions globally
  59. 59. SmallSat excellence Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  60. 60. Clyde Space Cubesats Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  61. 61. Meeting and generating demand Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  62. 62. enquiries@clyde.space Clyde Space Commercial in ConfidenceCopyright © 2017 Clyde Space Ltd. All rights reserved.
  63. 63. Questions & Discussion #scotdata
  64. 64. Refreshments & Networking Please check rear of badges for breakouts #scotdata
  65. 65. How to make an impact with your data: Going from Boring to Beautiful Louis Archer Tableau @louisarcher
  66. 66. TECHNOLOGY IS NOT THE PROBLEM
  67. 67. BUSINESS CULTURE IS THE PROBLEM
  68. 68. Collaboration Iteration Training
  69. 69. “Data visualisation is a language. It’s a means to convey an opinion, an argument.” Kim Rees – Founding Partner, Periscopic
  70. 70. Iraq: Deaths on the decline
  71. 71. Iraq: Deaths on the decline
  72. 72. Which do you prefer?
  73. 73. Design by Tableau; inspired by CTT Wireless Dashboard from
  74. 74. By Mike Cisneros
  75. 75. Functional Beautiful
  76. 76. Pleasurable experiences: The three levels of processing
  77. 77. Pleasurable experiences: The three levels of processing Visceral Behavioural Reflective
  78. 78. Visceral
  79. 79. Don Norman’s Pleasurable experiences: The three levels of processing Visceral Behavioural Reflective
  80. 80. Behavioural
  81. 81. Behavioural: Chart Choice
  82. 82. Behavioural:
  83. 83. Tableau Research: Eye-tracking
  84. 84. http://tabsoft.co/designmonth #VisualDesignTricks @acotgreave
  85. 85. Don Norman’s Pleasurable experiences: The three levels of processing Visceral Behavioural Reflective
  86. 86. How to make an impact with your data? The three levels of processing: Visceral Behavioural Reflective
  87. 87. Collaboration Iteration Training
  88. 88. However….
  89. 89. DASHBOARDS ARE THE PROBLEM
  90. 90. “Impact” is NOT just about beautiful/functional dashboards
  91. 91. The interesting thing was, we thought we were doing well, and then we discovered there was this big negative cost. It was like, ‘Oh my God.' Suddenly you go and say, 'Okay, I've discovered a new aspect of engine cost that we hadn't realized.‘ Suddenly you're going, “Bang, bang, bang, two minutes in Tableau” and you can see the average per month, the average per day, and it's like, “Oh, wow—we can do this slightly differently.' Within two days, I'd literally re-worked the whole instruction, sent it out to people, and off we went. As a result, it’s been a very significant difference in terms of U.S. dollars. Jonathan Capper Production Planning Manager
  92. 92. The interesting thing was, we thought we were doing well, and then we discovered there was this big negative cost. It was like, ‘Oh my God.' Suddenly you go and say, 'Okay, I've discovered a new aspect of engine cost that we hadn't realized.‘ Suddenly you're going, “Bang, bang, bang, two minutes in Tableau” and you can see the average per month, the average per day, and it's like, “Oh, wow—we can do this slightly differently.' Within two days, I'd literally re-worked the whole instruction, sent it out to people, and off we went. As a result, it’s been a very significant difference in terms of U.S. dollars. Jonathan Capper Production Planning Manager
  93. 93. Bang, bang, bang, two minutes… Jonathan Capper Production Planning Manager
  94. 94. Why? Why? Why? Data visualisation Known unknowns Predefined answers only Visual analytics Unknown unknowns Instant answers to new questions Bang, bang, bang, two minutes…
  95. 95. We help see and understand datapeople
  96. 96. Welcome Back to Big Data Scotland 2017 #datascot
  97. 97. Matthieu Poyade The Glasgow School of Art @GSofASimVis #datascot
  98. 98. Data Visualisation Dr. Matthieu Poyade
  99. 99. Data Visualisation Graphical communication process which empowers one to gain understanding and insight into data. “Visualization offers a method for seeing the unseen” - McCornick (1987) Data Visualisation
  100. 100. “People are generally better persuaded by the reasons which they have themselves discovered than by those which have come into the mind of others.” Pascal (1623 – 1662)
  101. 101. Pragmatic Approach to Visualisation • Data Communication – Effectively and efficiently make the user understand • Visual Efficiency – Perceptually efficient, make maximum use of the visual channels • Data is given – Visualization is not about generating data, although sometimes data requires pre-processing
  102. 102. Looking at History… London’s Cholera outbreak - Dr John Snow (1854)
  103. 103. Looking at history… The London Underground through time
  104. 104. Looking at History… The Glasgow Subway through time
  105. 105. We are drowning in a sea of data…The Flood of Data
  106. 106. A picture is worth a thousand words!!
  107. 107. How can Data Visualisation help to envision decision making impact Challenger Disaster January 28, 1986
  108. 108. • Engineers knew there was a problem: Failure of ring joint at low launching temperatures • Technical data presented through 13 Charts (notes, tables & diagrams) to management officers recommended not to launch • Data were presented poorly and didn’t enable the correlation between cooler temperatures and an increased chance of damage
  109. 109. Avoidable Tragedy? ? “Bad slides or bad engineering don’t kill people: bad decisions do” (Tufte)
  110. 110. How VR/AR can enhance Data Visualisation for Businesses?
  111. 111. AUGMENTED REALITY
  112. 112. AR and VR in Data Visualisation
  113. 113. Steven Faull Aggreko @stevenfaull #datascot
  114. 114. D E C E M B E R 1 7 Big Data Scotland 2017172 Steven Faull Head of Software & Analytics @stevenfaull From Reactive to Proactive to Predictive: Using Data to Drive Customer Benefit
  115. 115. 173
  116. 116. 174 Reactive The way things used to be… • Service scheduling • Customer calls for support • Technicians have no visibility of faults • Spare parts? • Low customer service level
  117. 117. 175 Proactive The way it is today… • Telemetry-enabled fleet • The ROC • Visibility of every alert and alarm globally • Highly proactive • Enhanced reliability & customer service
  118. 118. 176 Innovation Examples… • Enterprise Social Networking • Geo-spatial Sales application • Cloud technology • Micro-services architecture on Service Fabric • Big Data Analytics
  119. 119. 177 Applications
  120. 120. 178 Predictive The future…using data to enable: • Near-time data analytics • Predictive alerting to the ROC • ‘Just-in-Time’ servicing • Zero breakdowns • ‘Best-in-Class’ customer service
  121. 121. 179 Fire Prevention
  122. 122. 180 High Temperature
  123. 123. 181 Approach
  124. 124. 182 Approach 1. Build a great team
  125. 125. 183 Approach 1. Build a great team 2. Empower for innovation and exploration
  126. 126. 184 Approach 1. Build a great team 2. Empower for innovation and exploration 3. Find a great sponsor
  127. 127. 185 Approach 1. Build a great team 2. Empower for innovation and exploration 3. Find a great sponsor 4. Sell the vision through stories
  128. 128. 186 Approach 1. Build a great team 2. Empower for innovation and exploration 3. Find a great sponsor 4. Sell the vision through stories 5. Collaborate, collaborate, collaborate
  129. 129. 187 Approach 1. Build a great team 2. Empower for innovation and exploration 3. Find a great sponsor 4. Sell the vision through stories 5. Collaborate, collaborate, collaborate 6. Agile delivery
  130. 130. 188 What’s Next? • Reliability • Market Intelligence • Service Intervals • Condition Based Service • Product Development • Procurement • Audit • Sales Effectiveness
  131. 131. Thank You189 Steven Faull Head of Software & Analytics @stevenfaull
  132. 132. Sarah Forbes Peterson @srf1980 #datascot
  133. 133. GUIDING PRINCIPLE 1 Full supply chain visibility for everyone GUIDING PRINCIPLE 2 Predict the future state based on data GUIDING PRINCIPLE 3 Enable ‘next best step’ decision-making or management by exception to flourish GUIDING PRINCIPLE 4 Enforcing a single source of truth encourages collaboration across the entire supply chain GUIDING PRINCIPLE 5 Provide a real-time trigger for smarter communication
  134. 134. Scott Krueger Skyscanner @skrueg #datascot
  135. 135. From Producers To Consumers Data Engineering @ Skyscanner Scott Krueger Principal Data Engineer
  136. 136. EVOLUTION
  137. 137. Brief history of travel
  138. 138. Skyscanner growth: interesting challenges
  139. 139. But wait - we have been designing during growth
  140. 140. Designing for growth
  141. 141. Producers and Consumers
  142. 142. Unified Flow - Data Platform
  143. 143. CHALLENGES
  144. 144. Perfection is in the eye of the data holder Data systems old and new have constraints. It is a fact of life. Accept it, find them, and change things for what you need.
  145. 145. Scale – often the first business driver?
  146. 146. Data Quality
  147. 147. Organise your data organisation Conway's law (1967): Technical system determines org structure ProcessIngest Unified Log Transport Archive Science / Report / Analytics
  148. 148. (Re)Organise for growth Process Ingest Unified Log Transport Archive Science / Report / Analytics
  149. 149. The Data Organisation As a Business Strategy Educate To Empower The problem with Data Engineers and Data Scientists is that they’re Data Engineers and Data Scientists.... (and no one else is) "I'm not going to find that out for you - you are"
  150. 150. Operational Strategy Total Ownership "You build it, you run it." - Werner Vogels This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service."
  151. 151. Product Evolution - Data Assets into Meaningful Information Operational Diagnostics Your organisation and strategy Business KPIs Financial Reporting Just what do I do with 4TB's of Data Events Per Day? Customer Products Partners and Suppliers Your customers (t)
  152. 152. The Future * Barrier to entry gets lower and lower * Data literacy and engineering up-skilling continues * SQL is back * Machine Learning / Auto-code generation * 5G
  153. 153. Wrap Up Unifying our data into a single platform… …frees up energy …to be invested into meaningful data products …that allow us to innovate faster …to better serve the world's travellers
  154. 154. ?
  155. 155. Questions & Discussion #datascot
  156. 156. Drinks & Networking #datascot
  157. 157. TAKING DATA SEARCH, DISCOVERY AND ANALYSIS TO THE NEXT LEVEL DAVID RIVETT C H I E F O P E R A T I N G O F F I C E R
  158. 158. Agenda  Unstructured Data  Information worker challenges  The power of precision search  When you don’t know where to start  Data Science & AI  Feral Data & GDPR
  159. 159. “Unstructured” Big Data In every organisation data is “relatively” Big  Organisational memory  Staff turnover  Personal and shared drives  Email  Document Management  Business Operational Systems
  160. 160. Today’s challenges… According to a McKinsey report, “employees spend 1.8 hours every day - 9.3 hours per week, on average - searching and gathering information. Put another way, businesses hire 5 employees but only 4 show up to work; the fifth is off searching for answers, but not contributing any value.” Source: Time Searching for Information.
  161. 161. Turning Data into Knowledge WHOLE DOCUMENTS A SINGLE SENTENCE CONTENT BY PARAGRAPH DOCUMENTS BY SECTION
  162. 162. The Chilcot Report  58 separate PDF files.
  163. 163. SPARKVS FLINK STEFAN PAPP,VDSG BERNHARD ORTNER,THINKBIGANALYTICS
  164. 164. Stefan Papp Seems himself as Data Evangelist who focuses on data since 2010. He is passionate about how data can transform how we live and work and keen to explore all data with an open mindset and in an agile manner. He is a member of theVienna Data Science Group and has been working across all industries and consults various companies on data strategies. He also works in projects as data architects Bernhard Ortner He is a Senior Data Engineer at Think Big Analytics,A Teradata Company. He has solid experience implementing several big data technologies across industries like telecommunications, finance, energy and government. In his experience, he has covered the whole spectrum in big data including visualization, data ingestion, fusion and integration.
  165. 165. TERADATA PORTFOLIO > High- impact business outcomes Analytic Solutions Analytics Business Consulting Business Value Framework Data Science Business Solutions Strategy and Roadmaps Design and Implementation Ecosystem Architecture Managed Services Architecture Expertise Public Cloud, Private Cloud, Managed Cloud, Hybrid Cloud, On-Premises Teradata Database, Teradata Aster® Analytics, Hadoop Teradata QueryGrid™, Presto, Listener™, Unity, AppCenter Technology Solutions • ~1,400 + Customers in 77 Countries • ~10,000 Employees including ~5,000 Consultants • Market Cap: U.S. $4 Billion+ • World’s Most Ethical Companies – Ethisphere Institute • Fortune: Top 10 U.S. Software Company • The leader in Gartner Magic Quadrant Data Warehouse and Data Management Solutions for Analytics • The Forrester Wave™: Big Data Hadoop-Optimized Systems In-Memory Database Platforms Teradata at-a-glance © 2017 Teradata
  166. 166. VIENNA DATA SCIENCE GROUP Mission • Nonprofit association which aims to promote knowledge about data science methods/big data/AI techniques. Diverse members • Academics, professionals,students and all other Data Science enthusiasts Different fields • Mathematics, physics,econometrics,electrical engineering,medical science,finance,real estate, computer science and social sciences
  167. 167. Traditional Data: • ERPs, CRM Databases • Highly structured 0 5 10 15 20 25 30 35 40 2006 2007 2008 2009 2010 2011 2012 2013 2014 1015 2016 2017 2018 2019 2020 Zettabytes Schema on Write Schema on Read “New Data” : • Human or Machine Generated • “unstructured”
  168. 168. SPARK – “THE SMART PHONE FOR BIG DATA” What do 4 Gen. Processing Engines have in common with the iPhone?
  169. 169.    
  170. 170. In summary
  171. 171. Administration • Power BI uses Azure Active Directory to authenticate users who login to the Power BI service User Friendly / Self-Serve • Uses a graphical user interface built aroundself-serve Security & Privacy • Currently complies with both EU- U.S. Privacy Shield and EU Model Clauses. • Committed to GDPR compliance Cross Platform Functionality & Embedding • Cloud and desktop versions • iOS and Andriod appsavailable • Can embed Power BI code within websites Sources & Connections • Wide list of our of the box connectors including Excel, SQL, MySQL, and widely used APIs (i.e. Google Analytics) Managing, Cleaning & Transforming • Self-serve ETL process • Flexible to perform more advanced queries, i.e. mergingand cleaning files Storing & Querying • Does not automatically store raw data however there is an optionto save data within the report • Can refresh data from source automatically Reports • Allows users to build visual reports • Can export the reportsto PowerPoint Dashboards • Elements from reports can be reproduced within dashboards • Best viewed on devices Visualisation • Out of the box visuals provided • Online library of custom visuals • Ability to create own visuals using R Analytics • Functionality similar to Excel • Has natural language processing built in Collaboration • Able to share with selected people or open access • Creates an associated conversation stream
  172. 172. Explain Enlighten Engage
  173. 173. Good Data Sufficient Observations Representative Timely
  174. 174. ANSWER Before Spark After Spark After Flink Program-API Map Reduce Spark Core Flink API Abstraction Pig, Cascading Spark Core Flink API Streaming Apache Storm Spark Streaming Flink API Machine Learning Mahout Spark ML FlinkML Graph Engine Giraph Spark GraphFrames Gelly SQL Hive Spark SQL TableAPI
  175. 175. SPARK ARCHITECTURE • APIs: Scala, Java, Python or R • Core Modules Custom Modules H2O
  176. 176. DISTRIBUTED DATA SETS • Data Containers to keep data in-memory • RDDs: untyped • DataFrames / DataSets: typed / object-oriented • Processing with data sharing instead of shared nothing • complex, multi-pass analytics (e.g. ML, graph) • interactive ad-hoc queries • real-time stream processing Query Input Query Query
  177. 177. STREAMING IN DETAIL • Data is micro-batched, i.e. dividied into n RDDs • For each RDD a task is launched and destroyed once it’s completed • A task is a function you apply, e.g. filter, map, …
  178. 178. ANALYTIC • MLlib: Out-of-box ML library • Continuously new algorithms are added • Sparkling water: port of H2O for spark  add additional DS algorithms • H2O: ML,AI library, for more advanced analytics • Deep Learning • Generalized Linear model • PCA • ….
  179. 179. SPARK - PIPELINES • Concept: sequence of algorithms to process and learn from data • A pipeline consists of • Transformers:transforms a DataFrame into another one, e.g. ETL Steps • Estimators: trains a model on a DataFrame, output is the trained model
  180. 180. SPARK PROGRAM • Step 1: create a distributed data set • Step II: apply filter, models,… • Step III: get result val training = spark.read.format(“jdbc”).option(“…“).toDF(”col1", ”col2") val lr = new LogisticRegression() lr.setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(lr)) val model = pipeline.fit(training) pipeline.write.overwrite().save(„name“) model.transform(testData) .select(„col1“, „col2“).collect().foreach(print…) https://spark.apache.org/docs/2.2.0/ml-pipeline.html
  181. 181. WHAT HAPPENS BEHINDTHE SCENES
  182. 182. • There is also a different approach…. ... a micro-batch is a collection of streaming events… But you can also use pure streaming
  183. 183. 0 4,000,000 8,000,000 12,000,000 16,000,000 Storm Flink Flink (10 GigE) Throughput: msgs/sec 10 GigE end-to-end 15m msgs/sec SURVIVAL OF THE FASTEST….
  184. 184. STREAMING DATA EVERYWHERE
  185. 185. 275 Streaming: continuous processing on data that is continuously produced Sources Message Broker Stream processor collect publish/subscribe analyze Serve & store
  186. 186. CONNECTORS AND ADAPTERS DataBases … TwitterFeeds SensorData DataPlatforms Feeds Applications …
  187. 187. FLINK ECOSYSTEM
  188. 188. SOURCE -> PROCESS -> SINK278 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  189. 189. 279 Streaming Source Streaming Source Streaming Source Consumer Forward events immediately to pub/sub bus Stream Processor Process at event time & update serving layer Message Broker Low latency High throughput Windowing / Out of order events State handling Fault tolerance and correctness
  190. 190. Huge Database Application Application Client Server NoSQL Application Application Micro Services Application Application Continuous Streaming Application Application RDMBS SECOND USE CASE: CONTINUOUS STREAMING
  191. 191. DOYOU HAVE CASESTHAT REQUIRETO PROCESS MORE THAN ONE MILLION MESSAGES (PER SECOND)? Fraud Prevention CEP – Complex Event Processing Spam Prevention Network anomaly detection Predictive Maintenance Deep Packet Inspection File Processing CRM Import BATCH STREAM ERP Data Social Media Internet of Everything Image Recognition
  192. 192. DOYOUR OPERATIONAL SYSTEMS DELIVER STREAMING DATA?
  193. 193. REAL TIME ANALYTICS • Anomaly Detection: detect outliers (= observation that deviates tremendous from other) • Distance or Density based => clustering • Spark is used to train the models, but it’s too slow for production • Flink natively support streams, but has less algorithms than spark
  194. 194. SPARK & FLINK AT A GLANCE Bernhard Ortner, Sr. Data Engineer bernhard.ortner@thinkbiganalytics.com Mobile +43 664 Contact: Stefan Papp, Data Architect stefan.papp@icloud.com Mobile: +43 699 10209453
  195. 195. Precision Search - The Internet Experience
  196. 196. Precision Search – The Desktop Experience
  197. 197. Precision Search - The ultimate experience
  198. 198. Use Cases  Research  Detailed Analysis & Review  Litigation  Mergers & Acquisitions  Investigative Journalism  Better informed decisions  Faster more confident results and process
  199. 199. Where to start? “You don’t know what you don’t know!” “To really get answers out of your data you need the right questions!”
  200. 200. Data Science & AI  Topic Clustering and Categorisation  Semantic Enhancement  Dictionary Enhancement  Pattern Matching  Sentiment  Inherent Structure
  201. 201. Abuse 60% Neglect 25% Disfunctional Family 10% Disability 5% S AMP LE P RE DI CTI O N Train Model using real case documents Model learns topics based on training data Model can now identify topics in case documents Model then predicts the probability of topics present in document Topic Clustering
  202. 202. Visualization of Topic Model
  203. 203. Surfacing Knowledge - Enhancement
  204. 204. Feral Data  What does this mean for GDPR?
  205. 205. Feral Data  Pattern matching PII  Bank, Credit card numbers  Passport numbers  Driving License details  Address, postcode, email  …
  206. 206. GDPR – Dutch Example
  207. 207. Thank You! More questions? Please visit us at our stand Take the Nalytics challenge!

×