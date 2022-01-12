Successfully reported this slideshow.
Data & Analytics
Jan. 12, 2022
NLP for videos: Understanding customers' feelings in videos - Albert Lewandowski, GetInData

Currently there are more and more created videos distributed via multiple social media channels. It becomes more and more important to monitor all of them by companies to verify their customers' feedback, reviews, opinions. During the talk, we talk about extracting text from videos, analyzing language and prepare robust, scalable infrastructure for it. The idea behind platform is about having the mix between managed and self-managed service for Big Data processing. The keynote shows the case study of the MVP of the platform for marketing companies.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com​

  1. 1. NLP for videos: Understanding customers' feelings in videos Author: Albert Lewandowski
  2. 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. About me ● Big Data DevOps Engineer - GetInData ● Focused on infrastructure, cloud, Big Data, AI, scalable web applications ● Certiﬁed Google Cloud Architect ● Certiﬁed Kubernetes Administrator
  3. 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Content ● Problem to solve ● Big Data Frameworks or not? ● Cloud Magic ● How to mix technologies? ● Observability ● Lessons learnt
  4. 4. Problem to solve
  5. 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The Problem Big volume of the videos is complex to be monitored while more and more young users prefer to mention brands in the video-based social media. 60% Of companies don’t convert leads into revenue 95% of a message when they watch it in a video, compared to 10% when reading it in the text Source: Agility PR, Rick Whittinghton, Hubspot, Insivia 54% of consumers want to see more video content from a brand or business they support
  6. 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Solution Scalable Cloud platform written in Golang, Python and React with Azure Machine Learning services, and with Apache Spark. Artiﬁcial Intelligence Efficient
  7. 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ~5 - 6 weeks for the project ● Which tools are the fastest in delivering results? ● What is the crucial to meet requirements for PoC? ● How can we analyze language? ● What data do we need to create valuable insights? ● Can we provide scalable platform?
  8. 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Perception Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  9. 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reality Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  10. 10. Big Data Frameworks or not?
  11. 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Quick start 1. Linux command line is enough as the entrypoint for the project. 2. Python script and managed services. 3. Do not reinvent the wheel.
  12. 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Big Data tools? 1. Apache Spark can be ﬂexible tool, especially when we know it and we want to test it with bigger scale. 2. Writing own app in Golang can be a wise choice when we want to proceed with simple actions like gathering data from external sources. 3. Limitations of the components a. Like external SDK
  13. 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  14. 14. Cloud Magic
  15. 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Complex Analysis Spark seems to be the right solution for it but the speed of development was more important than creating scalable solution. Processing Polish language is really tough and it requires much more code development. Spark NLP v3 from John Snow Labs is worth checking.
  16. 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Target output ● Frequency of the phrase (like the problem of the product). ● Feelings related to it and if there is only mentioned a problem or a problem is the main character. ● Each video is tagged with the categories corresponding to: type of content, feelings, key words. ● Visualizing changes depending on timeperiod.
  17. 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Managed Services Public cloud provides wide range of services but their quality may differ. Moreover, pricing can be really high.
  18. 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. List of steps 1. Get required links to videos. 2. Process video to get only audio. 3. Save audio to storage. 4. Get audio and process it with Azure Cognitive Services to receive text. 5. Save output to ElasticSearch. 6. Process output to get emotions and feelings based on the text with Spark.
  19. 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Video To Audio To Text ● Azure Cognitive Services ○ It works pretty well with many languages ○ Speech To Text ● Custom implementation ○ It requires a lot of time ○ Required for production use cases
  20. 20. How to mix technologies?
  21. 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Microservices - Perfect Match - We can easily divide the platform - Data Ingestion - there can be a big number of small parts of data - Data processing - no need for real time, batch processing in Spark works well - Queue is a must
  22. 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What about local setup? ● Docker Compose ○ Apps can be quickly containerized ● Cloud services ○ To mock or not to mock them? ● Remote developer instance ○ Ephemeral Kubernetes clusters might be a good idea also for your case
  23. 23. Observability
  24. 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability
  25. 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability Monitoring describes the process of gathering metrics about IT environment, running applications and observing the system performance Observability is about measuring how well internal states of the system can be inferred from knowledge of its external outputs (according to the control theory).
  26. 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability Example: - data processing job written in Spark, that rewrites data from location A to B. Gathering its metrics and setting up alerts or creating dashboard with simple runtime visualization are a quite simple tasks. However to achieve observability we should collect metrics about the amount of processed data, JVM statistics and some metrics about infrastructure under the hood.
  27. 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Quick and simple setup Prometheus Metrics Loki with Promtail Log Analytics
  28. 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What about alerts? Alerts signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
  29. 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What to monitor? Errors Quality and quantity Data scraping Self-managed Compute Resources Managed Compute Resources Performance of NLP pipelines Logs monitoring
  30. 30. Lessons learnt
  31. 31. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Keep It Simple ● In case of PoC, go with the simplest possible solution. ● Cloud services are always worthy being checked. ● Mixing technologies is a good idea if we already have know-how within the team
  32. 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Corner cases, corner cases ● Remember about corner cases ○ Processing greater number of events ○ Possibility to scale-up and scale-down environments ○ Limitations or downtime of any external services ○ Data Reprocessing ● CICD is always your friend ○ Unit and integration tests are must-have
  33. 33. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  34. 34. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Contact details albert.lewandowski@getindata.com LinkedIn: https://www.linkedin.com/in/albert-lewandowski
  35. 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join Us! Data Engineer Spark, Snowﬂake, Airﬂow, AWS Link Data Scientist Python, SQL, Data Science Link MLOps Engineer MLOps tools, Python, public cloud Link Data Engineer (GCP) GCP, Spark, BigQuery Link
  36. 36. Thank you for your attention!

