This is the presentation we shared at the AWS Summit 2017 in Bangalore. We are showcasing our high performance framework with the various components which enables an organization to be data driven. Find out how our components are engineered to scale, store data securely and process the data for insights.
UTAD - Jornadas de Informática - Potential of Big DataMarco Silva
Short Presentation given at the Universidade de Tras dos Montes (UTAD) IT event for students and faculty members. The talk is meant to be an overview of Big Data and how microsoft technologies tackle that subject and how students could leverage these tools on their projects and future.
CData Power BI Connectors - MS Business Application SummitJerod Johnson
The CData presentation introducing and demonstrating the CData Power BI Connectors (offering live connectivity to more than 100 SaaS, Big Data, and NoSQL sources).
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
Choosing the Right Database for My Workload: Purpose-Built Databases AWS Germany
AWS offers a broad range of databases purpose-built for your specific application use cases. Our fully managed database services include relational databases for transactional applications, non-relational databases for internet-scale applications, a data warehouse for analytics, an in-memory data store for caching and real-time workloads, and a graph database for building applications with highly connected data. If you are looking to migrate your existing databases to AWS, the AWS Database Migration Service makes it easy and cost-effective to do so. The session will cover various SQL engines, “cloud-native SQL” (Aurora), SQL DWH + Spectrum, NoSQL, GraphDB.
UTAD - Jornadas de Informática - Potential of Big DataMarco Silva
Short Presentation given at the Universidade de Tras dos Montes (UTAD) IT event for students and faculty members. The talk is meant to be an overview of Big Data and how microsoft technologies tackle that subject and how students could leverage these tools on their projects and future.
CData Power BI Connectors - MS Business Application SummitJerod Johnson
The CData presentation introducing and demonstrating the CData Power BI Connectors (offering live connectivity to more than 100 SaaS, Big Data, and NoSQL sources).
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
Choosing the Right Database for My Workload: Purpose-Built Databases AWS Germany
AWS offers a broad range of databases purpose-built for your specific application use cases. Our fully managed database services include relational databases for transactional applications, non-relational databases for internet-scale applications, a data warehouse for analytics, an in-memory data store for caching and real-time workloads, and a graph database for building applications with highly connected data. If you are looking to migrate your existing databases to AWS, the AWS Database Migration Service makes it easy and cost-effective to do so. The session will cover various SQL engines, “cloud-native SQL” (Aurora), SQL DWH + Spectrum, NoSQL, GraphDB.
4 slide overview of Microsoft BI & analytics architecture and how it would work with your current environment. See the PointDrive for more information - https://ptdrv.linkedin.com/5fj1ey0
Data Analytics Week at the San Francisco Loft
Uses of Data Lakes
Examples of using data lakes from different AWS customers.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Marie Yap - Enterprise Solutions Architect, AWS
This presentation about Data Warehouse modernization and extending it to the modern data platform by adding Big Data solution using EMR and Spark and streaming data with Kinesis Firehose. In addition, it will cover the use case of complimentory data lake for data warehouse. Moreover, this presentation include ETL tool selection process and ML consideration.
Polestar we hope to bring the power of data to organizations across industries helping them analyze billions of data points and data sets to provide real-time insights, and enabling them to make critical decisions to grow their business.
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Why HR Should Consider Agile Modern Data Delivery Platformsyed_javed
Modern data delivery platform like Lyftron provides universal data model capability to HR departments that enables changes from the source dynamically in the semantic layer and allows enterprises to avoid manual semantic data model changes.
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | EdurekaEdureka!
This Edureka Power BI Dashboard Tutorial will take you through step by step creation of Power BI dashboard. It helps you learn different functionalities present in Power BI tool with a demo on superstore dataset. You will learn how to create a Power BI dashboard by taking out multiple insights from superstore dataset and representing them visually.
Ubiquitous data does not always translate to actionable data, though most financial institutions have a treasure trove of data they are moving to the cloud and could be using today. The potential is huge, but most struggle just to make actionable data available, let alone turn it into business value at scale. This session will highlight come of the key use cases and technologies that provide the greatest returns and organizational impact.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
4 slide overview of Microsoft BI & analytics architecture and how it would work with your current environment. See the PointDrive for more information - https://ptdrv.linkedin.com/5fj1ey0
Data Analytics Week at the San Francisco Loft
Uses of Data Lakes
Examples of using data lakes from different AWS customers.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Marie Yap - Enterprise Solutions Architect, AWS
This presentation about Data Warehouse modernization and extending it to the modern data platform by adding Big Data solution using EMR and Spark and streaming data with Kinesis Firehose. In addition, it will cover the use case of complimentory data lake for data warehouse. Moreover, this presentation include ETL tool selection process and ML consideration.
Polestar we hope to bring the power of data to organizations across industries helping them analyze billions of data points and data sets to provide real-time insights, and enabling them to make critical decisions to grow their business.
Data Lakes: 8 Enterprise Data Management RequirementsSnapLogic
2016 is the year of the data lake. As you consider adopting an enterprise data lake strategy to manage more dynamic, poly-structured data, your data integration strategy must also evolve to handle new requirements. Thinking you can simply hire more developers to write code or rely on your legacy rows-and-columns centric tools is a recipe to sink in a data swamp instead of swimming in a data lake.
In this presentation, you'll learn about eight enterprise data management requirements that must be addressed in order to get maximum value from your big data technology investments.
To learn more, visit: https://www.snaplogic.com/big-data
Why HR Should Consider Agile Modern Data Delivery Platformsyed_javed
Modern data delivery platform like Lyftron provides universal data model capability to HR departments that enables changes from the source dynamically in the semantic layer and allows enterprises to avoid manual semantic data model changes.
Power BI Dashboard | Microsoft Power BI Tutorial | Data Visualization | EdurekaEdureka!
This Edureka Power BI Dashboard Tutorial will take you through step by step creation of Power BI dashboard. It helps you learn different functionalities present in Power BI tool with a demo on superstore dataset. You will learn how to create a Power BI dashboard by taking out multiple insights from superstore dataset and representing them visually.
Ubiquitous data does not always translate to actionable data, though most financial institutions have a treasure trove of data they are moving to the cloud and could be using today. The potential is huge, but most struggle just to make actionable data available, let alone turn it into business value at scale. This session will highlight come of the key use cases and technologies that provide the greatest returns and organizational impact.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes, and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
Automate Business Insights on AWS - Simple, Fast, and Secure Analytics PlatformsAmazon Web Services
Business analysts require easy access to data from across different parts of the business. In this session, learn why more customers have adopted Amazon Redshift than any other cloud-native Data Warehouse, and how they are building a broader analytics capability with data lakes on AWS.
Understand how AWS built machine learning (ML) into the services, taking away many of the time-intensive tasks of building an analytics platform. We cover why these customers choose Amazon Redshift for the accessibility to analysts, business reporting, deep security, ability to scale from GB to PB, and integration with the broader platform.
Learn about these customers who are increasingly opening insights to data analysts for data discovery and data scientists for machine learning. We also share how the AWS services such as AWS Glue and the coming ML-enabled AWS Lake Formation take away most of the heavy lifting,
in this slide i have tried to explain what an data engineer does and what is the difference between a data engineer and a data analytics and data scientist
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysNEWYORKSYS-IT SOLUTIONS
NEWYORKSYSTRAINING are destined to offer quality IT online training and comprehensive IT consulting services with complete business service delivery orientation.
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
Data Mesh is a trending approach to building a decentralized data architecture by leveraging a domain-oriented, self-service design. However, the pure definition of Data Mesh lacks a center of excellence or central data team and doesn’t address the need for a common approach for sharing data products across teams. The semantic layer is emerging as a key component to supporting a Hub and Spoke style of organizing data teams by introducing data model sharing, collaboration, and distributed ownership controls.
This session will explain how data teams can define common models and definitions with a semantic layer to decentralize analytics product creation using a Hub and Spoke architecture.
Attend this session to learn about:
- The role of a Data Mesh in the modern cloud architecture.
- How a semantic layer can serve as the binding agent to support decentralization.
- How to drive self service with consistency and control.
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. Using our cloud-based service you can easily connect to your data, perform advanced analysis, and create stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Amazon Web Services
If you are crafting a better customer experience, automating your business, or modernizing your systems, you are likely finding that your data and analytics platform is absolutely critical to your success. In this session, we will look at how customers are building on the managed services from Amazon Web Services to meet the needs of the business. Patterns we see gaining popularity are near-real time engagement with customers over mobile, also combining and analyzing unstructured consumer behavior with structured transactional data, as well as managing spiky data workloads. See how our customers use our managed, elastic, secure, and highly available services to change what is possible.
Craig Stires, Head of Big Data and Analytics, Amazon Web Services, APAC
Processing the volume and variety of data that today’s organizations produce can be both challenging and costly – especially with a legacy data warehouse. Combining the scale and performance of the cloud with AWS and APN Partner solutions for migration, integration, analysis, and visualization can help overcome these obstacles. With a modern data warehouse architecture, organizations can store, process, and analyze massive volumes of data of virtually any type. Register for this upcoming webinar, where Pearson - an education and media conglomerate - will share in detail how they built a scalable and flexible business intelligence platform on the cloud, with Tableau and AWS.
Learn how you can seamlessly load and transform data in Amazon Redshift with Matillion ETL and analyze it with Tableau. Hear how 47Lining and NorthBay can provide insights to guide you through migration with ease. Tableau will discuss best practices to analyze your data on AWS and share new insights throughout your organization.
Kyvos Insights is unlocking the power of Big Data analytics with “OLAP on Hadoop” technology.
Kyvos is a solution which brings a new model of online analytical processing (OLAP) to Big Data that allows users to visually create and analyze cubes on Hadoop. This technology enables users to easily derive valuable insights for better, more informed business decisions through previously unattainable levels of scalability and interactivity.
Similar to Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT (20)
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT
1. Architecting Data Lakes on AWS with HiFX
Established in the year 2001, HiFX is an Amazon Web Services Consulting Partner
that has been designing and migrating applications and workloads in the cloud since
2010. We have been helping organisations to become truly data driven by building
data lakes in AWS since 2015.
2. 2
The Challenges
Lack of agility and accessibility for data analysis which would aid the product team to make smart business
decisions and improve strategies
Increasing volume and velocity of data. With new digital properties getting added, there was a need to
design the collection and storage layers that would scale well.
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Poorly recorded data. Often, the meaning and granularity of the data was getting lost in processing.
Dozens of independently managed collections of data, leading to data silos. Having no single source of
truth was leading to difficulties in identifying what type of data is available, granting access and integration.
04
03
02
01
3. Our Journey from Data to Decisions with an AWS powered Data Lake
Connecting dozens of data
streams and repositories to a
unified data pipeline enabling
near realtime access to any data
source
Engineering well designed big
data stores for reporting and
and exploratory analysis
Architect a secure, well
governed data lake to store all
data in a raw format. S3 is the
fabric with which we have woven
the solution.
Processing data in streams or
batches to aid analytics and
machine learning, supplemented
by smart workflow management to
orchestrate the tasks
Dynamic dashboards and
visualisations that makes data tell
stories and help drive insights.
Offering recommendations and
predictive analytics off the data
in the data lake
4. COLLECT STORE PROCESS CONSUME
Scribe (Collector)
Accumulo (Storage)
Acccumulo is the data consumer
component responsible for reading data
from the event streams (Kinesis Streams),
performing rudimentary data quality checks
and converting data to Avro Format before
loading it to the Data Lake
Our Data Lake in S3 captures and store
raw data at scale for a low cost. It allows us
to store many types of data in the same
repository while allowing to define the
structure of the data at the time when it is
used
scribe
accumulo
Scribe collects data from the trackers
and writes them to Kinesis Streams
It is written in Go and engineered for
high concurrency, low latency and
horizontally scalability
Currently running on two c4.large
instances, our API latency for 50
percentile is 12.6ms and 75 percentile
is 36ms. This is made possible
because of the consistent and
predicable performance of Kinesis
6. 6
Why Amazon S3 For Data Lake ?
Performance relatively lower than an HDFS cluster, but doesn't affect our workloads significantly. EMRFS with
consistent view (backed by DynamoDB) works really well
Native support for versioning, tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies and Secure–
SSL, client/server-side encryption
Dozens of independently managed collections of data, leading to data silos. Having no single source of truth was
leading to difficulties in identifying what type of data is available, getting access to it and integration.
Unlimited number of objects and volume of data, along with 99.99% availability and 99.999999999%
durability. Lower TCO and easier to scale than HDFS
Decoupled storage and compute allowing multiple & heterogeneous analysis clusters to use the same data
04
03
02
01
7. Prism (Processor)
Lens (Consumer)
Custom built reporting & visualisation app
to help business owners to easily interpret,
visualise and record data and derive
insights
Detailed Analysis of KPIs, Event
Segmentation, Funnels, Search Insights,
Path Finder, Retention/Addiction Analysis
etc powered by Redshift and Druid. Using
Pgpool to cache Redshift queries.
process
consume
Unified Processing Engine using
Apache Spark running on EMR written
in Scala
Airflow is used to programmatically
author, schedule and monitor
workflows
Prism generates data for tracking KPIs
and perform funnel, pathflow, retention
and affinity analysis. It also include
machine learning workloads that
generate recommendations and
predictions
COLLECT STORE PROCESS CONSUME
9. KPIs
Product relationship
Understand which products are viewed consecutively
Product affinity
Understand which products are purchased together
Sales
Hourly, daily, weekly, monthly, quarterly, and annual
Average market basket
Average order size
Cart abandonment rate
Shopping cart abandonment rate
Days/ Visits To "Purchase"
The average number of days and sessions from the first website
interaction to purchase.
Cost per Acquisition
(Total Cost of Marketing Activities) / (# of Conversions)
Repeat purchase rate
What % of our customers are repeat customers
10. Product page performance
Measuring product performance
The scatter plot compares the number of unique
users that view each product with the number of
unique users that add the product to basket, with
the size of each dot being the number of uniques
that buy the product.
Any products located in the lower right corner are
highly trafficked but low converting - any effort
spent fixing those product pages (e.g. by checking
the copy, updating the product images or lowering
the price) should be rewarded with a significant
sales uplift, given the number of people visiting
those pages
11. Measuring product performance
In contrast, products located in the top left of the
plot are very highly converting, but low trafficked
pages. We should drive more traffic to these
pages, either by positioning those products more
prominently on catalog pages, for example, or by
spending marketing dollars driving more traffic to
those pages specifically. Again, that investment
should result in a significant uplift in sales, given
how highly converting those products are.
Similarly, products in the lower left corner are
performing poorly - but it is not clear whether this is
because they have low traffic levels and /or are
poor at driving conversions. We should invest in
improving the performance of these pages, but the
return on that investment is likely to be smaller (or
harder to achieve) than the other two opportunities
Product page performance
12. Identifying products / content that go well together
Market basket analysis is an Association rule learning technique aimed at uncovering the associations and
connections between specific products in our store
In a market basket analysis, we look to see if there are combinations of products that frequently co-occur in
transactions.
We can use this type of analysis to:
• Inform the placement of content items on sites, or products in catalogue
• Drive recommendation engines (like Amazon’s customers who bought this product also bought these
products…)
• Deliver targeted marketing (e.g. emailing customers who bought products specific products with other
products and offers on those products that are likely to be interesting to them.)
Editor's Notes
Trackers allow to collect data from any type application (web, mobile), service or device.
All trackers adhere to the predefined Tracker Protocol.
Send data asynchronously, and hence would not affect the performance.
Collectors are stateless and horizontally scalable
Each shard in kinesis steam can support reads upto 2MB per second and writes upto 1,000 records / 1MB per second.
Scribe and Accumulo automatically detects new shards and scales.
Accumulo KCL Java App that buffers the events and upload the batches as Avro files to the Data Lake
DAGs in Airflow pull dimension and offline data and loads them to the Data Lake
Streaming workloads for near-realtime reports ( news ) and batch for daily reports (classifieds)
EMR with instance fleets provides a cost effective way to process data
Data processing involves quality checks, cleansing, reconciling and enrichment.
Subset of data ( sans page views, data beyond last 2 years) sent to druid & redshift.
All data (historical) stored as parquet in S3 with lifecycle policy. Athena can point to this data for ad hoc analysis
Druid for realtime data and aggregate queries that do not require join. Redshift for everything else.
LENS built with React using nvd3 charting library. Built for multi tenancy with fine-grained ACLs. Apis powered by Go
Recommendations powered by DyanamoDB ( predictable performance and no need to sort on multiple fields)