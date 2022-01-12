Successfully reported this slideshow.
Jan. 12, 2022
Predicting Startup Market Trends based on the news and social media - Albert Lewandowski, GetInData

Technology
Jan. 12, 2022
25 views

Nowadays, one tweet can have impact on the value of the company or cryptocurrency. It becomes important for companies to be able to know everything what's happening in the market, especially for startups or when entering the new market. The presentation is about presenting the complex platform used for creating and verifying the strategy for a startup from the Wellbeing market. We go through web scraping-based data ingestion to ElasticSearch, NLP pipelines to understand what people write and what is the possible future of each market predicted by PySpark job.
Author: Albert Lewandowski
Linkedin: https://www.linkedin.com/in/albert-lewandowski/
___
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com​

Predicting Startup Market Trends based on the news and social media - Albert Lewandowski, GetInData

  1. 1. Predicting Startup Market Trends based on the news and social media Author: Albert Lewandowski
  2. 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent. About me ● Big Data DevOps Engineer - GetInData ● Focused on infrastructure, cloud, Big Data, AI, scalable web applications ● Certiﬁed Google Cloud Architect ● Certiﬁed Kubernetes Administrator
  3. 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Content ● Business Use Case. ● Main challenges. ● Gathering data. ● Processing data. ● Business War Gaming. ● Quick start on your computer.
  4. 4. Business Use Case
  5. 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Predict Startup Trends Idea Startups take advantage of buzzwords in each market so it’s valuable to make an automated market research to ﬁnd the best market/trend-ﬁt for a startup. Problem Research on the new market is time-consuming and may be tough when there are more and more news each minutes.
  6. 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Predict Startup Trends Solution (?) What can we do if we can gather all data from the most popular sites and social media to get insights and check the trends? Gathered and preprocessed data can be used to predict the trends and analyze if there are any direct competitors.
  7. 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. The Startup Sectors: mobility, environment Its product: platform for measuring noise pollution in the cities and in the industry area, with the IoT devices. Current status: MVP Next steps: Align strategy to the market trends. And here we comes to some simple pipelines :)
  8. 8. Main challenges
  9. 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ~3 - 4 weeks for the project ● Which tools are the fastest in delivering results? ● What is the crucial to meet requirements? ● How can we measure the trends based on the news? ● What data do we need to create valuable insights? ● Can we predict here anything?
  10. 10. Gathering data
  11. 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. News sites All Startup News Some sites already block too frequent scraping How to detect changes on the site? Batch or real-time? Which sites are veriﬁed?
  12. 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How to get data? API Clients Data Scrapers Multiple packages How can we manage workers?
  13. 13. Diagram
  14. 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  15. 15. Understanding language
  16. 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Target output ● Frequency of the phrase (like the problem which startup tries to solve). ● Feelings related to it and if there is only mentioned a problem or a problem is the main character. ● Each article or tweet is tagged with the categories corresponding to: type of content, feelings, key words. ● Separate analysis process for monitoring competitors.
  17. 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Complex Analysis Spark seems to be the right solution for it but the speed of development was more important than creating scalable solution. Processing Polish language is really tough and it requires much more code development. Spark NLP v3 from John Snow Labs is worth checking.
  18. 18. Process data continuously
  19. 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. One scheduler to rule them all Airﬂow is easy to be installed and setup, especially in the Kubernetes. DAGs are the great way to schedule all pipelines and monitor if they succeed or not.
  20. 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Simple or advanced? It’s worth to start simple - Python is a mature solution in the NLP sector. Use Kubernetes if you know it, even a bit - you can simply install all required components there and take advantage of the docs and blogs about some open-source solutions.
  21. 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Use what you know in the beginning ● ElasticSearch is a central storage for all data. ● PostgreSQL database is used for storing metadata, information about sites and which articles are already processed.
  22. 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Next steps ● Improve retry policies and add queueing system (Cloud Pub/Sub) for manage jobs. ● Add dynamically workers to each pipeline. ● Add frontend part for managing target sites and desired phrases which we want to monitor.
  23. 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Perception Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  24. 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Reality Business logic CI/CD Idempotency Reprocessing Explainability Monitoring Testing Serving Infrastructure Data Ingestion Security
  25. 25. Monitoring of the efﬁciency
  26. 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability
  27. 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability Monitoring describes the process of gathering metrics about IT environment, running applications and observing the system performance Observability is about measuring how well internal states of the system can be inferred from knowledge of its external outputs (according to the control theory).
  28. 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Observability Example: - data processing job written in Spark, that rewrites data from location A to B. Gathering its metrics and setting up alerts or creating dashboard with simple runtime visualization are a quite simple tasks. However to achieve observability we should collect metrics about the amount of processed data, JVM statistics and some metrics about infrastructure under the hood.
  29. 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Quick and simple setup Prometheus Metrics Loki with Promtail Log Analytics
  30. 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What to monitor? Errors Quality and quantity Data scraping Self-managed Compute Resources Managed Compute Resources Performance of NLP pipelines Logs monitoring
  31. 31. Visualizing results
  32. 32. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Kibana Kibana is powerful tool for visualizing language-related data and even non-technical users can simply learn it. Great place to create dashboard with refreshed content with tags.
  33. 33. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Superset Open-source app based on Flask App Builder that is an interesting solution for creating dashboards and share it with all stakeholders. Easy integration Simple forking / updating features Support multiple authentication layer
  34. 34. Business War Gaming
  35. 35. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What is Business War Gaming? “Business wargaming” is a role‐playing simulation of a dynamic business situation that involves a series of teams, each assigned to assume the identity of an entity with a stake in the situation. Data Experience Strategy
  36. 36. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Multiple factors on the market ● Competitors ● Law regulators ● Public sector ● Speed of development ● What customers would like to see or what problems do they see?
  37. 37. © Copyright. All rights reserved. Not to be reproduced without prior written consent. What is a trend? ● More articles ● More people write about the problem ● There may be law changes that would take advantage of the solution ● Similar companies receive funding
  38. 38. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 21th Century Features Internet Analyses provides valuable information in near real-time and shows all data that would be tough to ﬁnd by human. ● Quick detecting trends ● Monitor as many sources as we need ● Automated creating reports for all players
  39. 39. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Join Us! Data Engineer Spark, Kafka, Airﬂow, public cloud Link Backend Engineer Java / Scala, microservices Link MLOps Engineer MLOps tools, Python, public cloud Link DevOps / SRE GCP, Terraform, Prometheus Link
  40. 40. Thank you for your attention!
  41. 41. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Q&A
  42. 42. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Contact details Email albert.lewandowski@getindata.com LinkedIn https://www.linkedin.com/in/albert-lewandowski

