Evolving s3 story

•Download as PPTX, PDF•

0 likes•143 views

Avi Perez

talk i gave at @BigThings meetup group

Technology

Evolving a premium raw data product
from simple spark script in 3 month
Avi Perez, Big Data Team Leader @AppsFlyer

AppsFlyer
• 28M raised top VC
• 200M To 13B Daily Events [3 Years]
• 40GB To 5TB [gz] daily text data
• 25 → 60ppl R&D during 2016
• Top 15 Israeli startups by inc.com

What We Do
Media SourcesApp Developers App Users
X10
9
4B$ in media payments annually
measured

AppsFlyer Raw Data Channels
Raw vs Aggregated
• Real Time Stream From kafka
• Online Query Data API (csv)
HTTP
Columnar DB
S3

New Use Case
• Big Clients with BI Systems
• Very large files / large number of files
• Tackling current limitations

secor
Amazon S3
Rapid Prototype ...
write
notify
read

Challenges ...
•Scale in #clients
•Client monitoring
•Security
•Schema
•Flow & Control

Requests keep coming...
• More Clients
• More Events Types
• Customizable Columns

Improving Data Format
• Scanning a lot of data is easy...but not that fast
• Being a big data company is not necessarily
saying you need to read all your data fast

Moving to Parquet . . .
Twitter & Cloudera
• Columnar storage (load only what you need)
• Space efficient (50% improvement)
• Read Time efficient (98% improvement )

Stateful S3 Bucket Structure
For automatic bots parsing

View Layer
• Flatten fields mapping
• Versions

From script to Micro Service
• Tasks creation (Buckets, IAM, Credentials etc)
• Search on Task Executions
• Access to the report files
• Get statuses from the Job HTTP
• Highly available

Moving Toward A Product . . .
• Clients want SLA . . .

Service transparency
Push notification to slack once there
is an issue

Results
Loading data for specific clients
Load specific clients raw
data from 2.5TB
compressed topic
Same load with
partitioning
1.5
min
30sec

From hard coded List to RDS
Client
A
Client
B
... ...

Secured Email Notifications
click Get link

Vault
• Secure Secret Storage
• Dynamics Secrets
• Data Encryption
• Leasing and renewal
• Revocation

Cost Optimization
Helping our clients with download
Daily sessions output file
for one of the clients
The same report
compressed
(.gz)
60G
B
2.1
GB

Support keep asking the same
questions….

Monitoring . . .
Monitor, monitor, monitor….
• Metrics
• Re-tries
• PDs

Going premium . .
• On boarding
• Well defined schema fields
• Self Serve and pricing

Thank you
And…
We are
hiring!!
avi@appsflyer.com

What's hot

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

Fwdays

Rich Internet Applications and Flex - 3

Vijay Kalangi

MongoDB 3.2 Feature Preview

Norberto Leite

This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...

MongoDB

MongoDB Atlas

MongoDB

Дмитрий Попович "How to build a data warehouse?"

Fwdays

SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News

atwork

Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.

Workshop 2: Building a streaming data platform on AWS

Amazon Web Services

Solving your Backup Needs - Ben Cefalo mdbe18

MongoDB

Classic event, incident, problem and change management are ITSM practices that are getting integrated with DevOps/SRE and ML through a competency known as AIOps. Large streams of data generated through logs, metrics and traces are organized and computed using machine learning algorithms to extract insights on the anomalies of system behavior that could be impacting end-users and business transactions. Businesses cannot afford to see their end-users impacted by those anomalies and therefore would want to proactively predict the likelihood of systems regressing and take corrective action long before any material impact. In this talk, we show the use of simple linear regression and multivariate linear regression techniques to predict the likelihood of system behavior resulting in one or two sigma of standard deviation. We show how to use FOSS tools to predict them using various decision trees that are integrated to high performing streaming platforms like Apache Flink, Apache Beam, Prometheus and Grafana which makes it a lot easier to visualize the various alerts and triage their way back to performing root cause analysis. These high performing systems are also backed by KAFKA for its streaming and distributed computing capabilities by partitioning the data for various staged analysis some of which can be done in parallel and concurrently based on the use cases. We present a fully integrated architecture that helps you realize a commercial AIOps capability without having to license expensive software products. The above open architecture allows you to implement various ML algorithms as needed and its agnostic to programming languages and tools. The talk will combine various techniques with demos and is focused to practicing engineers and developers who are familiar with ML.

Effective AIOps with Open Source Software in a Week

Databricks

Agile Software Development is becoming the defacto way of building software these days. More and more enterprises, from large fortune 500 to small shop start-ups, are adopting agile development methodologies. But Agile Software development is more than just a methodology or a practice. It's also a combined set of tools and platforms that today are at our disposal to allows to iterate faster, get-to-market sooner and also fail faster. These set of tools augment our development cycles by a few orders of magnitude and allow developers to be much more productive.

MongoDB: Agile Combustion Engine

Norberto Leite

Tech UG - Newcastle 09-17 - logic apps

Michael Stephenson

Securing an Azure Function REST API with Azure Active Directory

Rick van den Bosch

Sitecore Symposium: DMS Where is the data at?

Pieter Brinkman

MongoDB Atlas - eHarmony’s New Message Store

Evan Rodd

Amazon Kinesis is a platform for streaming data on AWS, offering powerful services to make it easy to load and analyze streaming data. In this session, you’ll learn about how AWS customers are transitioning from batch to real-time processing using Amazon Kinesis, and how to get started. We will provide an overview of streaming data applications and introduce the Amazon Kinesis platform and its services. We will walk through a production use case to demonstrate how to ingest streaming data, prepare it, and analyze it to gain actionable insights in real time using Amazon Kinesis. We will also provide pointers to tutorials and other resources so you can quickly get started with your streaming data application.

Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...

Amazon Web Services

Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...

Lucidworks

APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM

apidays

Scribe insight 01 publisher deep dive

Scribe Software Corp.

AWS is an incredibly popular environment for running MongoDB deployments. Today you have many choices about instance type, storage, network config, security, how you configure MongoDB processes, and more. In addition, you now have options when it comes to tooling to help you manage and operate your deployment. In this session, we’ll take a look at several recommendations that can help you get the best performance out of AWS.

Maximizing MongoDB Performance on AWS

MongoDB

What's hot (20)

Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"

Rich Internet Applications and Flex - 3

MongoDB 3.2 Feature Preview

MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...

MongoDB Atlas

Дмитрий Попович "How to build a data warehouse?"

SharePoint UserGroup Stuttgart - Martina Grom - Office 365 News

Workshop 2: Building a streaming data platform on AWS

Solving your Backup Needs - Ben Cefalo mdbe18

Effective AIOps with Open Source Software in a Week

MongoDB: Agile Combustion Engine

Tech UG - Newcastle 09-17 - logic apps

Securing an Azure Function REST API with Azure Active Directory

Sitecore Symposium: DMS Where is the data at?

MongoDB Atlas - eHarmony’s New Message Store

Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data ...

Multi-Tenant Log Analytics SaaS Service using Solr: Presented by Chirag Gupta...

APIdays Helsinki 2019 - GraphQL API Management with Amit P. Acharya, IBM

Scribe insight 01 publisher deep dive

Maximizing MongoDB Performance on AWS

Similar to Evolving s3 story

Serverless SQL

Torsten Steinbach

How Totango uses Apache Spark

Oren Raboy

This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.

Getting Started with Real-time Analytics

Amazon Web Services

Comment transformer vos données en informations exploitables

Elasticsearch

Cómo transformar los datos en análisis con los que tomar decisiones

Elasticsearch

Automated Data Synchronization: Data Loader, Data Mirror & Beyond

JeremyOtt5

Automation options with Office 365

Robert Crane

In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we’ll review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases

Amazon Web Services

Transforming data into actionable insights

Elasticsearch

Challenges in Data Analytics: Different application scenarios need different storage solutions: HBASE is ideal for point query scenarios but unsuitable for multi-dimensional queries. MPP is suitable for data warehouse scenarios but engine and data are coupled together which hampers scalability. OLAP stores used in BI applications perform best for Aggregate queries but full scan queries perform at a sub-optimal performance. Moreover, they are not suitable for real-time analysis. These distinct systems lead to low resource sharing and need different pipelines for data and application management.

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Tech Triveni

Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.

Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017

Amazon Web Services

Big data and Analytics on AWS

2nd Watch

FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

Amazon Web Services

Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.

Building your Datalake on AWS

Amazon Web Services

Richard Freeman talks about how the data science team at JustGiving built KOALA, a fully serverless stack for real-time web analytics capture, stream processing, metrics API, and storage service, supporting live data at scale from over 26M users. He discusses recent advances in serverless computing, and how you can implement traditionally container-based microservice patterns using serverless-based architectures instead. Deploying Serverless in your organisation can dramatically increase the delivery speed, productivity and flexibility of the development team, while reducing the overall running, DevOps and maintenance costs.

Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...

Matt Stubbs

Getting started with Amazon Kinesis

Amazon Web Services

Getting started with amazon kinesis

Jampp

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Hakka Labs

At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day. To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message. In this talk, we will present Pinterest’s logging pipeline, and share our experience addressing these challenges. We will dive deep into the three components we developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. We will also share our experience in operating Kafka at scale in the cloud.

Scalable and Reliable Logging at Pinterest

Krishna Gade

This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time. Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).

Building real time data-driven products

Lars Albertsson

Similar to Evolving s3 story (20)

Serverless SQL

How Totango uses Apache Spark

Getting Started with Real-time Analytics

Comment transformer vos données en informations exploitables

Cómo transformar los datos en análisis con los que tomar decisiones

Automated Data Synchronization: Data Loader, Data Mirror & Beyond

Automation options with Office 365

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases

Transforming data into actionable insights

Apache CarbonData+Spark to realize data convergence and Unified high performa...

Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017

Big data and Analytics on AWS

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

Building your Datalake on AWS

Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...

Getting started with Amazon Kinesis

Getting started with amazon kinesis

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

Scalable and Reliable Logging at Pinterest

Building real time data-driven products

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service

giselly40

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Presentation on how to chat with PDF using ChatGPT code interpreter

naman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Choosing the right accounts payable services provider is a strategic decision that can significantly impact your business's financial performance and operational efficiency. By considering factors such as expertise, range of services, technology infrastructure, scalability, cost, and reputation, businesses can make informed decisions and select a provider that aligns with their unique needs and objectives. Partnering with the right provider can streamline accounts payable processes, drive cost savings, and position your business for long-term success. https://katprotech.com/accounts-payable-and-purchase-order-automation/

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Katpro Technologies

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Heather Hedden, Senior Consultant at Enterprise Knowledge, presented “The Role of Taxonomy and Ontology in Semantic Layers” at a webinar hosted by Progress Semaphore on April 16, 2024. Taxonomies at their core enable effective tagging and retrieval of content, and combined with ontologies they extend to the management and understanding of related data. There are even greater benefits of taxonomies and ontologies to enhance your enterprise information architecture when applying them to a semantic layer. A survey by DBP-Institute found that enterprises using a semantic layer see their business outcomes improve by four times, while reducing their data and analytics costs. Extending taxonomies to a semantic layer can be a game-changing solution, allowing you to connect information silos, alleviate knowledge gaps, and derive new insights. Hedden, who specializes in taxonomy design and implementation, presented how the value of taxonomies shouldn’t reside in silos but be integrated with ontologies into a semantic layer. Learn about: - The essence and purpose of taxonomies and ontologies in information and knowledge management; - Advantages of semantic layers leveraging organizational taxonomies; and - Components and approaches to creating a semantic layer, including the integration of taxonomies and ontologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Enterprise Knowledge

A Call to Action for Generative AI in 2024

Results

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Civil Lines Women Seeking Men

Delhi Call girls

The Raspberry Pi 5 was announced on October 2023. This new version of the popular embedded device comes with a new iteration of Broadcom’s VideoCore GPU platform, and was released with a fully open source driver stack, developed by Igalia. The presentation will discuss some of the major changes required to support this new Video Core iteration, the challenges we faced in the process and the solutions we provided in order to deliver conformant OpenGL ES and Vulkan drivers. The talk will also cover the next steps for the open source Raspberry Pi 5 graphics stack. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://eoss24.sched.com/event/1aBEx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Igalia

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service

A Domino Admins Adventures (Engage 2024)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Presentation on how to chat with PDF using ChatGPT code interpreter

The 7 Things I Know About Cyber Security After 25 Years | April 2024

A Year of the Servo Reboot: Where Are We Now?

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Finology Group – Insurtech Innovation Award 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Powerful Google developer tools for immediate impact! (2023-24 C)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

IAC 2024 - IA Fast Track to Search Focused AI Solutions

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

A Call to Action for Generative AI in 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Boost Fertility New Invention Ups Success Rates.pdf

Artificial Intelligence: Facts and Myths

Evolving s3 story

1. Evolving a premium raw data product from simple spark script in 3 month Avi Perez, Big Data Team Leader @AppsFlyer

2. AppsFlyer • 28M raised top VC • 200M To 13B Daily Events [3 Years] • 40GB To 5TB [gz] daily text data • 25 → 60ppl R&D during 2016 • Top 15 Israeli startups by inc.com

3. What We Do Media SourcesApp Developers App Users X10 9 4B$ in media payments annually measured

4. AppsFlyer Raw Data Channels Raw vs Aggregated • Real Time Stream From kafka • Online Query Data API (csv) HTTP Columnar DB S3

5. New Use Case • Big Clients with BI Systems • Very large files / large number of files • Tackling current limitations

6. secor Amazon S3 Rapid Prototype ... write notify read

7. Naive SPARK SQL

8. Challenges ... •Scale in #clients •Client monitoring •Security •Schema •Flow & Control

9. Requests keep coming... • More Clients • More Events Types • Customizable Columns

10. What are we facing here...

11. What was missing?

12. Improving Data Format • Scanning a lot of data is easy...but not that fast • Being a big data company is not necessarily saying you need to read all your data fast

13. Moving to Parquet . . . Twitter & Cloudera • Columnar storage (load only what you need) • Space efficient (50% improvement) • Read Time efficient (98% improvement )

14. Stateful S3 Bucket Structure For automatic bots parsing

15. View Layer • Flatten fields mapping • Versions

16. From script to Micro Service • Tasks creation (Buckets, IAM, Credentials etc) • Search on Task Executions • Access to the report files • Get statuses from the Job HTTP • Highly available

17. Abstraction . . .

18. Moving Toward A Product . . . • Clients want SLA . . .

19. Service transparency Push notification to slack once there is an issue

20. Data Segregation

21. Results Loading data for specific clients Load specific clients raw data from 2.5TB compressed topic Same load with partitioning 1.5 min 30sec

22. From hard coded List to RDS Client A Client B ... ...

23. Secured Email Notifications click Get link

24. Vault • Secure Secret Storage • Dynamics Secrets • Data Encryption • Leasing and renewal • Revocation

25. Cost Optimization Helping our clients with download Daily sessions output file for one of the clients The same report compressed (.gz) 60G B 2.1 GB

26. Moving to YARN

27. Prioritizing spark Jobs

28. Support keep asking the same questions….

29. Monitoring . . . Monitor, monitor, monitor…. • Metrics • Re-tries • PDs

30. Going premium . . • On boarding • Well defined schema fields • Self Serve and pricing

31. What we learn . . .

32. Thank you And… We are hiring!! avi@appsflyer.com

Editor's Notes

sales come to r&d and asked a way to get organic data
Big data analytics נותנים כלים למשתמשים שלנו למדוד כמה איכותי הטרפיק שהם מביאים מערוצי פירסום שונים מאיפה מגיע אותו טראפיק איכותי וכלים לעזור להם לקבל החלטות מכמויות אדירות של מידע שמפפיעות באופן ישיר על הככנסות שלהם
Raw vs aggregate
Not always using out dashboard We asked them what we do with our API’s Jobs ETL to run on s3 High load on AF systems How we can solve Many queries per day We have inherint limit of 200k rows CMS big clients, remove limitations. Very large companies want all their data Script “issue” that cost us 50k
פתרון: נדרשנו לקבל החלטות קשות ב r&d בידיעה שאנחנו נצטרך לשלם בתחזוקה ידנית, אבל לא היתה ממש ברירה ורצינו שהלקוח האסטרטגי הזה יהיה לנו. וזה הפתרון שהצגנו 13B events → kafka → secor (service for persisting kafka log to S3) As sequence files SparkSQL on top on that Creating manually a bucket on our production S3 for that account with only List \ READ permissions.creating IAM specifc user manually and Providing him the credentails And running the process with chrons \ mesos each morning עלינו לפרודקשיין בתוך כמה ימים, ואפשרנו גישה רק לטופיק הקטן ביותר של התקנות. הלקוח חתם.
Mobile App Letgo Raises $100 Million From Naspers To Take Over Classifieds In The U.S.
reports/<Home Folder>/account /<event-type>-<date YYYY-MM-dd> reports/<Home Folder>/apps/app-id /<event-type>-<date YYYY-MM-dd>
Flatten the schema Schema on write Schema on read Code reuse Versioning Readability \ Simplification
Tell the story of lets go Which build the entire marketier team work flow base on the dasgbaord they are creating
Analytics process which calculate each day to X app (partiton keys) And saved that as meta-data on the files bucket
על מנת להגן על הלקוחות שלנו וגם להגן עלינו מעצמנו מטעויות. הטמענו שירות שנתן לנו דרכים להפיק keys \ secret באופן שרירותי
Helping our clients to improve their download time from our S3
Scheduled tasks were not executed Same job executed twice Not trivial to maintance DAG Dynamic allocation

Evolving s3 story

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evolving s3 story

Similar to Evolving s3 story (20)

Recently uploaded

Recently uploaded (20)

Evolving s3 story

Editor's Notes