Spark is used to perform in-memory transformations on customer data collected by Totango to generate analytics and insights. Luigi is used as a workflow engine to manage dependencies between batch processing tasks like metrics generation, health scoring, and alerting. The tasks are run on Spark and output to S3. A custom Gameboy controller provides monitoring and management of the Luigi workflow.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh.
Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com.
Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.
In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark.
Specifically:
– Why and how do we migrate?
– How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift
– How do we scale the team on Spark (with 80+ people in our data science team)
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...InfluxData
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, Store, and Visualize Data with InfluxDB and Grafana | InfluxDays Virtual Experience NA 2020
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Andrii Vozniuk
My workshop at the Learning Analytics Summer Institute (LASI) 2016: http://lasi16.snola.es/#!/schedule/113
Educational data continues to grow in volume, velocity and variety. Making sense of the educational data in such conditions requires deployment and usage of appropriate scalable, real-time processing tools supporting a flexible data schema. Elasticsearch is one of the popular open-source tools meeting the enlisted requirements. Initially envisioned as a search engine capable of operating at scale and in real time, Elasticsearch is used by organisations such as Wikimedia and Github, which deal with big data on daily basis. In addition, Elasticsearch is used increasingly often as analytics platform thanks to its scalable architecture and expressive query language. Until recently, the exploitation of Elasticsearch for (learning) analytical purposes by practitioners was hindered by a high entrance barrier due to the complexity of the query language and the query specificities. This is currently changing with the ongoing development of Kibana, an open-source tool that allows to conduct analysis and build visualisations of Elasticsearch data through a graphical user interface. Kibana does not require the user to dive into technical details of the queries (although it is still possible) and hence makes big educational data visualisations accessible to regular users. The additional value of Kibana comes in play whenever several visualisations are combined on a single dashboard, enabling to use multiple coordinated views for an interactive explorative analysis. Both Elasticsearch and Kibana, together with Logstash are part of an analytics stack often referred to as ELK. Logstash supports data acquisition from multiple sources (including twitter, RSS, event logs) thanks to its rich set of available connectors. Custom connectors can be developed for case-specific sources. In addition to the mentioned values, ELK enables building analytics infrastructures decoupled from the learning platform, i.e., it allows to host separately the learning environment (with the analytics functionalities) and the data storage without affecting the end-user experience.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.
Group of Airflow core committers talking about what's coming with Airflow 2.0!
Speakers: Ash Berlin-Taylor, Kaxil Naik, Kamil Breguła Jarek Potiuk, Daniel Imberman and Tomasz Urbaszek.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
Lessons learned while taking Presto from alpha to production at Twitter. Presented at the Presto meetup at Facebook on 2015.03.22.
Video: https://www.facebook.com/prestodb/videos/531276353732033/
We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.
Speakers: Robert Sanders, Shekhar Vemuri
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.
In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark.
Specifically:
– Why and how do we migrate?
– How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift
– How do we scale the team on Spark (with 80+ people in our data science team)
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, St...InfluxData
Jorge de la Cruz [Veeam Software] | RESTful API – How to Consume, Extract, Store, and Visualize Data with InfluxDB and Grafana | InfluxDays Virtual Experience NA 2020
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow****
From Pydata DC 2016
Description
Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb.
Abstract
The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn
- pros and cons of several Python-based/Python-supporting data pipelining libraries
- the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for
- some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system
- some quick-start tips for implementing Airflow at your organization.
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Andrii Vozniuk
My workshop at the Learning Analytics Summer Institute (LASI) 2016: http://lasi16.snola.es/#!/schedule/113
Educational data continues to grow in volume, velocity and variety. Making sense of the educational data in such conditions requires deployment and usage of appropriate scalable, real-time processing tools supporting a flexible data schema. Elasticsearch is one of the popular open-source tools meeting the enlisted requirements. Initially envisioned as a search engine capable of operating at scale and in real time, Elasticsearch is used by organisations such as Wikimedia and Github, which deal with big data on daily basis. In addition, Elasticsearch is used increasingly often as analytics platform thanks to its scalable architecture and expressive query language. Until recently, the exploitation of Elasticsearch for (learning) analytical purposes by practitioners was hindered by a high entrance barrier due to the complexity of the query language and the query specificities. This is currently changing with the ongoing development of Kibana, an open-source tool that allows to conduct analysis and build visualisations of Elasticsearch data through a graphical user interface. Kibana does not require the user to dive into technical details of the queries (although it is still possible) and hence makes big educational data visualisations accessible to regular users. The additional value of Kibana comes in play whenever several visualisations are combined on a single dashboard, enabling to use multiple coordinated views for an interactive explorative analysis. Both Elasticsearch and Kibana, together with Logstash are part of an analytics stack often referred to as ELK. Logstash supports data acquisition from multiple sources (including twitter, RSS, event logs) thanks to its rich set of available connectors. Custom connectors can be developed for case-specific sources. In addition to the mentioned values, ELK enables building analytics infrastructures decoupled from the learning platform, i.e., it allows to host separately the learning environment (with the analytics functionalities) and the data storage without affecting the end-user experience.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
Talk I did on log aggregation with the ELK stack at Leeds DevOps. Covers how we process over 800,000 logs per hour at laterooms, and the cultural changes this has helped drive.
Group of Airflow core committers talking about what's coming with Airflow 2.0!
Speakers: Ash Berlin-Taylor, Kaxil Naik, Kamil Breguła Jarek Potiuk, Daniel Imberman and Tomasz Urbaszek.
Introduction to Streaming Distributed Processing with StormBrandon O'Brien
Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Introducing streaming data concepts, Storm cluster architecture, Storm topology architecture, and demonstrate working example of a WordCount topology for SIGKDD Seattle chapter meetup.
Presented by Brandon O'Brien
Code example: https://github.com/OpenDataMining/brandonobrien
Meetup: http://www.meetup.com/seattlesigkdd/events/222955114/
Lessons learned while taking Presto from alpha to production at Twitter. Presented at the Presto meetup at Facebook on 2015.03.22.
Video: https://www.facebook.com/prestodb/videos/531276353732033/
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Slides to talk presented on #ngpartycz about the history of APIs, evolution and how to pick the right technology for you. It might not be (and probably will not be) the technically best solution. And of course you will find out why is GraphQL actually REST.
What are the basic key points to focus on while learning Full-stack web devel...kzayra69
Mastering full-stack web development with Django involves Python fundamentals, HTML/CSS/JavaScript, Django basics, database management, and deployment, with Django's template language simplifying dynamic content rendering and promoting code maintainability.
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB
MongoDB natively provides a rich analytics framework within the database. We will highlight the different tools, features and capabilities that MongoDB provides to enable various analytics scenarios ranging from AI, Machine Learning and applications. We will demonstrate a Machine Learning (ML) example using MongoDB and Spark.
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
Invitae is one of the fastest growing genetic information companies, whose mission is to bring comprehensive genetic information into mainstream medical practice to improve the quality of healthcare for billions of people. We have recently partnered with another lab, requiring an integration layer that was developed as part of a dizzying leap from a traditional Python service architecture to Scala Streaming applications on Kafka and Kubernetes. This presentation is our story, where we discuss challenges and solutions, error handling and resilience techniques, technology stack choices and compromises, tools and approaches we have developed, and general insights. Beyond engineering itself, our team's goal is enabling others to join in. Building an application entirely of Streams is a significant and in many ways liberating paradigm shift. In addition to learning to architect and understand how the application will behave and evolve, success depends on great tooling. We will show, for example, how we extended KStreams API to seamlessly include Avro Schema as part of our build and code infrastructure, completely automating SerDe derivation, introducing typed topics, and still supporting polyglot teams. Other highlights: - Self-healing streams with aggregation, and deciding when to crash - Connectors vs Streams for side effects - Scheduling with Streams - Deriving topology diagrams - Monitoring and metrics as Streams - Combining Avro, Swagger and code generation, plus avro4s vs avrohugger comparison - Typelevel Cats and its role in our success - http4s and hybrid testing
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
4. SaaS Customer Journey
DECREASE VALUE
DECREASE VALUE
CHURN
CHURN
GROW VALUE
FIRST VALUE
START
INCREASE USERS
INCREASE USAGE
EXPAND FUNCTIONALITY
CHURN
ONGOING VALUE
5. Customer Success Platform
●
Analytics for SaaS companies
●
Clear view of customer journey
●
Proactively prevent churn
●
Increase upsale
●
Track feature, module and total usage
●
Health score based on usages pattern
●
Improve conversion from trial to paying
9. About Totango
●
Founded in 2010
●
Size: ~50 (half R&D)
●
Offices in Tel Aviv, San Mateo CA
●
120+ customers
●
~70 million events per day
●
~1.5 billion indexed documents per month
●
Hosted on Amazon Web Services
11. Terminology
●
Service – Totango's customer (e.g. Zendesk)
●
Account – Service's (Zendesk's) customer
●
SDR (Service Data Record) – User activity
event (e.g. user Joe from account Acme did
activity Login in module Application)
12. SDR reception
●
Clients send SDRs to the gateway, where they
are collected, filtered, packaged and finally
stored in S3 for daily/hourly batch processing.
●
Realtime processing also notified.
14. Account Data Flow
1) Raw Data (SDRs)
2) Account Aging (MySQL - legacy)
3) Activity Aggregations (Hadoop – legacy)
4) Metrics (Spark)
5) Health (Spark)
6) Alerts (Spark)
7) Indexing to Elasticsearch
15. Data Structure
●
Account documents stored on Amazon S3
●
Hierarchial directory structure per task param:
e.g. /s-1234/prod/2015-04-27/account/metrics
●
Documents have a predefined JSON schema.
JSON mapped directly to Java document class
●
Each file is an immutable collection of documents
One object per line – easily partitioned by lines
17. Resilient Distributed Datasets
●
RDDs – distributed memory abstraction that lets
programmers perform in-memory computations
on large clusters in a fault-tolerant way
●
Initial RDD created from stable storage
●
Programmer defines a transformation from an
immutable input object to a new output object
●
Transformation function class can (read: should!)
be built and tested separately from Spark
18. Transformation flow
Read: inputRows = sparkContext.textFile(inputPath)
Decode: inputDocuments = inputRows.map(new
jsonToAccountDocument())
Trasform: docsWithHealth = inputDocuments.map(new
augmentDocumentWithHealth(healthCalcMetadata))
… other transformations may be done, all in memory …
Encode: outputRows = docsWithHealth.map(new
accountDocumentToJson())
Write: outputRows.saveAsTextFile(outputPath)
19. Examples (Java)
Class AugmentDocumentWithHealth implements
Function<AccountDocument, AccountDocument>
AccountDocument call(final AccountDocument document)
throws Exception { … return document with health … }
Class AccountHealthToAlerts implements
FlatMapFunction<AccountDocument, EventDocument>
Iterable<EventDocument> call(final AccountDocument
document) throws Exception { … generate alerts … }
20. Transformation function
●
Passed as parameter to Spark transformation:
map, reduce, filter, flatMap, mapPartitions
●
Can (read: should!!) be checked in Unit Tests
●
Serializable – sent to Spark worker serialized
●
Function must be idempotent!
●
May be passed immutable metadata
22. Why a workflow engine?
●
Managing many ETL jobs
●
Dependencies between jobs
●
Continue pipeline from point of failure
●
Separate workflow per service per date
●
Overview and drill-down status Web UI
●
Manual intervention
23. Workflow engines
●
Azkaban, by LinkedIn (mostly for Hadoop)
●
Oozie, by Apache (only for Hadoop)
●
Amazon Simple Workflow Service (too generic)
●
Amazon Data Pipeline (deeply tied to AWS)
●
Luigi, by Spotify (customizable) – our choice!
24. What is Luigi
●
Like Makefile – but in Python, and for data
●
Dependencies are managed directly in code
●
Generic and easily extendable
●
Visualization of task status and dependency
●
Command-line interface
36. Gameboy
●
Totango-specific controller for Luigi
●
Provides high level overview
●
Enable manual re-run of specific tasks
●
Monitor progress, performance, run time,
queue, worker load etc
40. Summary
●
Typical data flow – from raw data to insights
●
We use Spark for fast in-memory
transformations, all code is in Java
●
Our batch processing pipeline consist of a
series of tasks, which are managed in Luigi
●
We don't use all of Luigi's python abilities, and
we've added some new management abilities
בחברות SaaS, אינם קונים רשיון אלא משלמים עבור שימוש.הם יישארו וישלמו כל עוד הם מרוצים, אך בכל שלב הם גם יכולים לעזוב בפתאומיות ולא לחדש את המנוי.
כדי למנוע עזיבה, צריך להבין את מצב הלקוח, לפני שהוא קיבל את ההחלטה לעזוב כי אז כבר מאוחר מדי.
גם ללקוח קיים שאינו חושב לעזוב, אפשר לעזוב לנצל את המוצר טוב יותר, להגדיל את אפשרויות השימוש שלו, להרחיב את המנוי וכו&apos;.
תחום ה-Customer Success הוא ניהול האינטראקציה עם הלקוחות בכל שלבי המסע שלהם.
אז מה אנחנו מאפשרים?
כלי ניתוח עבור חברות המספקות תוכנה כשירות.
הצגה ברורה של מסע הלקוח, החל מהשלב שהוא מתנסה במוצר, כשהוא רוכש מנוי, כיצד הוא מפיק ערך, ועוד
גילוי ירידה בשימוש לפני שהדבר הופך להחלטה לעזוב
הגדלת מכירות של תכונות נוספות שהלקוח יפיק מהן ערך
מעקב אחר שימוש ברמת תכונה, מודול או בסיכום כללי
מדד בריאות המוגדר על בסיס דפוסי השימוש במוצר
שיפור ההמרה של לקוחות מתנסים למשלמים
ככה נראה ה-Health Console שלנו
ניתן לראות סיכום של הלקוחות, לפי סגמנטיםלמשל לקוחות חדשים, לקוחות גדולים, וכו&apos;
ניתן לעקוב אחר שימוש במודולים שונים במערכת, על פני הזמן ובחלוקות לפי פרמטרים שונים
אנחנו עושים dogfooding ומשתמשים במערכת שלנו על עצמנו
כשאנחנו מכניסים פיצ&apos;ר חדש, אנחנו מודדים את השימוש של חלקים שונים בו כדי להבין כיצד משתמשים בו, האם יש חלקים שחוששים להשתמש בהם או לא ברור איך
הגרף מראה את הסיכום הכללי, וניתן ללחוץ ולהכנס לרשימת הלקוחות המפורטת ולראות מי השתמש במה
קצת על Totango
קיימת כ-5 שנים
מונה כ-50 אנשים, אנשי הפיתוח בארץ והמכירות בחו“ל
עשרות מליוני אירועים ביום
מיליארדי מסמכים מאונדקסים מדי חודש
כל התשתיות בענן של Amazon
קצת על האריכטקטורה שבעזרתה המידע הגולמי אודות השימוש שנשלח אלינו למערכת הופך למידע שימוש אודות הלקוחות של הלקוחות שלנו
שירות – לקוח שלנו
חשבון – לקוח של הלקוח שלנו
אס די אר – המידע שנשלח כאשר משתמש אצל הלקוח של הלקוח שלנו, מבצע פעולה במערכת של הלקוח שלנו
האופן שהמידע זורם למערכת שלנו די סטנדרטי – נכנס מבחוץ לשרת gateway
שם הוא עובר עיבוד ראשוני, ואז נארץ בקבצים על גבי S3
אני מניח שזה דפוס שימוש טיפוסי שבו משתמשים רבים על התשתית הזאת, נכון?
מדי יום, וחלקית גם מדי שעה, המידע שהגיע עובר תהליך עיבוד
התהליך מאותחל ע“י Jenkins שזו מערכת תזמון נפוצה
את התהליך עצמו מנהל luigi שעליו ארחיב בהמשך
עיבוד המידע נעשה ע“י Spark, וחלקית ע“י מערכות ישנות שעובדות על כלים אחרים
לבסוף המידע מאונדקס ב-Elasticsearch ומשם מוצג בממשק המשתמש של Totango
בקצרה – השלבים שעובר המידע אצלנו במערכת
מתחילים במידע גולמי
מעיפים חשבונות שאינם פעילים
סוכמים את הפעילויות שמשתמשים עשו
מחשבים מטריקות שונות על הנתונים
מחשבים את מדד הבריאות על כל חשבון
מייצרים התראות במקרה של שינויים הדורשים התייחסות
ולבסוף כל המידע כאמור נשמר באינדקס
המידע נשמר בפורמט JSON שממופה ישירות למבנים שאנחנו עובדים איתם בקוד
נשמר בקבצי טקסט שבהם כל שורה היא אובייקט JSON
הקבצים נשמרים במבנה תיקיות הירארכי ב-S3
אז איך בעצם המידע עובר את העיבוד?
ספארק מבוסס על הפשטה של עיבוד המידע
פשוט: טוענים את המידע ממקור המידע
ומגדירים סדרה של טרנספורמציות על המידע, כלומר סדרה של הוראות – איך להפוך את הקלט לפלט
לדוגמה – זה הקוד שלנו שמעשיר את האובייקט של חשבון עם מדד הבריאות שלו
בהתחלה טוענים את הרשומות מתוך הקובץ וממירים את שורות הטקסט לאובייקט ב-java
אז קוראים לטרנספורמציה map ומעבירים לה פונקציה
הפונקציה הזו, כפי שנראה תיכף, צריכה לדעת לעשות דבר אחד פשוט – לקחת רשומה בקלט ולהחזיר רשומה בפלט
אפשר לבצע עוד טרנספורמציות כאלה והכל ייעשה בזכרון, זאת לעומת hadoop למשל שחייב בכל שלב לקרוא ולכתוב לדיסק
לבסוף כותבים את מידע הקלט בחזרה לדיסק, או ליעד מידע אחר כלשהו
לדוגמה
פונקציה שמעשירה חשבון עם בריאות מקבלת ומחזירה אובייקט של חשבון
פונקצייה שמייצרת התראות מקבל חשבון, ומחזירה רשימה של התראות
הפונקציות הללו מועברות כפרמטר לטרנספורציות של ספארק
את הפונקציה אפשר וכדאי לבדוק ב-Unit test, אין צורך להרים ספארק כדי לבדוק את הלוגיקה ואפשר לעשות זאת בנפרד לחלוטין (למעשה כך עשינו אצלנו)
הפונקציה נשלחת אל המידע ולא המידע אל הפונקציה, לכן הפונקציה צריכה להיות ניתנת לאריזה
על הפונקציה להיות אידמפוטנטית – כלומר להחזיר את אותו פלט בהינתן אותו קלט, בלי קשר לכמות ההרצות – בקיצור אסור לה לשנות state חיצוני כלשהו!
ניתן להעביר לפונקציה ביצירה מידע כלשהו שהוא לקריאה בלבד
אז יש לנו הרבה לקוחות, הרבה משימות עיבוד, איך מנהלים את כל זה?
למה בכלל צריך מערכת לניהול זרימת העבודה?
צריך לנהל הרבה משימות
יש תלויות בין המשימות
כשלון משימות הוא עניין של זמן, צריך לדעת להמשיך מאותו מקום בו השרשרת נכשלה
להפריד בצורה ברורה בין הרצות לתאריכים שונים וללקוחות שונים
לצפות בצורה קלה על המצב של המשימות, וגם להכנס לפרטים כדי להבין למה משימה נתקעה או נכשלה
להתערב ידנית במקרה של טיפול בתקלה או פעילות יזומה
בחנו כמה אפשרויות, חלקכם בטח שמעתם עליהם
כמה כבר שאלו אותי לגבי אלה
לא נתעכב עליהם
לאחר סקירה של מערכות אלה ועוד, אנחנו בחרנו את Luigi
אז מה זה לואיג&apos;י?
כמו קבצי הגדרות לבניית קוד תוכנה, רק שזה בפייטון ובשביל טיפול במידע ולא בניית קוד
התלויות מנוהלות ישירות בקוד ולא בקובץ הגדרות
כללי וניתן בקלות להרחבה
ניתן לראות בצורה ויזואלית את המצב של המשימות והתלויות ביניהן
יש גם ממשק בשורת פקודה, כך שזה נותן למפתחים להריץ משימה ישירות לפי הפרמטרים שלה
בשביל להגדיר משימה בלואיג&apos;י, צריך להגדיר לה:
מה הקלט
מה הפלט
במה היא תלוייה
ומה הקוד עצמו שמתבצע כשהיא רצה
הנה דוגמא מ-Spotify
המשימה תלויה במשימה קודמת שנקראת Streams
היא פותחת קובץ, מבצעת עיבוד וכותבת לקובץ המטרה
אפשר להשתמש בקלאסים אחרים שכבר יודעים לעשות דברים נפוצים, ואז רק צריך להגדיר להם את הפרמטרים המתאימים
למשל ממשק ל-hadoop, spark, elasticsearch, hive, וכו&apos;
אפשר להעביר פרמטרים למשימה, והם יועברו למשתנה בקוד
אפשר בקלות להריץ את המשימה בשורת הפקודה גם בלי השרת של luigi ולראות איך היא מתנהגת
אפשר לראות בקלות אילו משימות מתבצעות עכשיו, מה הפרמטרים שלהן ומה הססטוס שלהן
אפשר גם לראות גרף תלויות שמראה כל משימה ובמי היא תלויה או מי תלוי בה, ומה הסטטוסס הנוכחי שלהן
למשל כאן רואים משימות ירוקות שסיימו, משימה בכחול שכרגע רצה ומשימות צהובות שמחכות לה
אם משימה כלשהי הייתה נכשלת, היינו רואים אותה באדום ואפשר היה ללחוץ על העיגול האדום כדי לקבל פרטים על הכשלון
כאן רואים גרף עם הרבה יותר תלויות
אז איך אנחנו משתמשים ב-Luigi אצלנו ב-Totango?
קודם כל, כל הקוד שלנו הוא ב-Java
אנחנו משתמשים ב-luigi על מנת לנהל את התהליך והתלויות, אבל ההרצה עצמה בעצם קוראת לקוד שלנו ב-java ומעבירה לו את הפרטמרטים הרלוונטיים
לואיג&apos;י יודע לנהל את המשימות והתלות ביניהן, אך בהתחלה בשביל ליזום את ההרצה אנחנו משתמשים ב-Jenkins
יש לנו כלי פנימי שנקרא Gameboy
הוא מאפשר לנו לראות בצורה טובה יותר מידע רלוונטי אלינו אודות התהליכים שלנו שרצים במערכת
וגם מאפשר לנו להריץ מחדש תהליכים מסוימים עם פרמטרים מסוימים
כך נראה המסך הראשי
ניתן לראות מי רץ כרגע ומה הסטטוס שלו
ניתן לראות גם סטטוס לתאריך מסוים
כאן רואים גרפים אודות כמה תהליכים רצים בכל מיני שעות על פני היום או השבוע
וגם ניתן לראות ללקוחות שלנו מי כבר סיים עיבוד מידע ליום מסויים
ניתן לבחור לקוח, לבחור תאריך, ולהריץ עליו את תהליך ה-batch החל משלב מסויים
אנחנו עושים זאת במקרה של תקלות, או כאשר רוצים לעדכן נקודתית מידע מסוים