At Zillow, we calculate a Zestimate® home value for about 100 million homes nationwide daily. But between batch runs, users could update their home facts or even list their home on the market. Housing markets move fast, and we want Zestimates to reflect the latest state of our housing data. In this talk, I will present the architecture of the Zestimate and the infrastructure powering it. Inspired by Lambda Architecture, the Zestimate relies on both a near real-time and a batch component. I will highlight how the design allows us to be nimble in the face of data changes, while not sacrificing algorithmic accuracy during daily batch runs.
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
BI & Big data use case for banking - by rully feranataRully Feranata
Big Data and all about its business case in banking industry - how it will change the landscape and how it can be harness in order organization to stay ahead of the game
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
BI & Big data use case for banking - by rully feranataRully Feranata
Big Data and all about its business case in banking industry - how it will change the landscape and how it can be harness in order organization to stay ahead of the game
Data governance with Unity Catalog PresentationKnoldus Inc.
Databricks Unity Catalog is the industry’s first unified governance solution for data and AI on the lakehouse. With Unity Catalog, organizations can seamlessly govern their structured and unstructured data, machine learning models, notebooks, dashboards and files on any cloud or platform. Data scientists, analysts and engineers can use Unity Catalog to securely discover, access and collaborate on trusted data and AI assets, leveraging AI to boost productivity and unlock the full potential of the lakehouse environment. This session will cover the potential of unity catalog to achieve a flexible and scalable governance implementation without sacrificing the ability to manage and share data effectively.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...Edureka!
** Data Analytics Masters' Program: https://www.edureka.co/masters-program/data-analyst-certification **
This Edureka PPT on "How to become a data analyst" will provide you with a crisp knowledge of who a data analyst is and what are the roles and responsibilities of a data analyst. The salary trends and the companies hiring Data Analyst.
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.
As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.
Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.
Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
This webinar discusses why Apache Hadoop most typically the technology underpinning "Big Data". How it fits in a modern data architecture and the current landscape of databases and data warehouses that are already in use.
Many customers across every market segment are interested in applying AI techniques to provide some kind of automated reasoning to many areas of real-time analytics where, traditionally, a human being or perhaps a rule-based expert system was the final arbitrer.
Late Arriving Fact scenario occurs when the transaction or fact data comes to data warehouse way later than the actual transaction occurred in the source application.
In facts data scenario, actual fact data created prior & sent later from source application to warehouse cause late arrival facts, the situation become messy because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past. It is important to be conceptually clear upon the nature of business process & the source application behavior.
Storemates is the 'airbnb of storage', an exciting entrant to the sharing economy and a start up that is ser to disrupt the booming self storage industry with its peer to peer model.
This deck is aimed at potential investors, interested owning a slice of the exciting Sharing Economy
In this talk, Rsqrd AI welcomes Kevin Powell, Director of Zestimates & AI Platform at Zillow! Kevin speaks about the technology and complexity behind the Zestimate and its impact at Zillow.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
Data governance with Unity Catalog PresentationKnoldus Inc.
Databricks Unity Catalog is the industry’s first unified governance solution for data and AI on the lakehouse. With Unity Catalog, organizations can seamlessly govern their structured and unstructured data, machine learning models, notebooks, dashboards and files on any cloud or platform. Data scientists, analysts and engineers can use Unity Catalog to securely discover, access and collaborate on trusted data and AI assets, leveraging AI to boost productivity and unlock the full potential of the lakehouse environment. This session will cover the potential of unity catalog to achieve a flexible and scalable governance implementation without sacrificing the ability to manage and share data effectively.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Continuous Data Ingestion pipeline for the EnterpriseDataWorks Summit
Continuous Data ingestion platform built on NIFI and Spark that integrates variety of data sources including real-time events, data from external sources , structured and unstructured data with in-flight governance providing a real-time pipeline moving data from source to consumption in minutes. The next-gen data pipeline has helped eliminate the legacy batch latency and improve data quality and governance by designing custom NIFI processors and embedded Spark code. To meet the stringent regulatory requirements the data pipeline is being augmented with features to do in-flight ETL , DQ checks that enables a continuous workflow enhancing the Raw / unclassified data to Enriched / classified data available for consumption by users and production processes.
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...Edureka!
** Data Analytics Masters' Program: https://www.edureka.co/masters-program/data-analyst-certification **
This Edureka PPT on "How to become a data analyst" will provide you with a crisp knowledge of who a data analyst is and what are the roles and responsibilities of a data analyst. The salary trends and the companies hiring Data Analyst.
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase DataWorks Summit
As one of the few closed-loop payment platforms, PayPal is uniquely positioned to provide merchants with insights aimed to identify opportunities to help grow and manage their business. PayPal processes billions of data events every day around our users, risk, payments, web behavior and identity. We are motivated to use this data to enable solutions to help our merchants maximize the number of successful transactions (checkout-conversion), better understand who their customers are and find additional opportunities to grow and attract new customers.
As part of the Merchant Data Analytics, we have built a platform that serves low latency, scalable analytics and insights by leveraging some of the established and emerging platforms to best realize returns on the many business objectives at PayPal.
Join us to learn more about how we leveraged platforms and technologies like Spark, Hive, Druid, Elastic Search and HBase to process large scale data for enabling impactful merchant solutions. We’ll share the architecture of our data pipelines, some real dashboards and the challenges involved.
Speakers
Kasiviswanathan Natarajan, Member of Technical Staff, PayPal
Deepika Khera, Senior Manager - Merchant Data Analytics, PayPal
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
This webinar discusses why Apache Hadoop most typically the technology underpinning "Big Data". How it fits in a modern data architecture and the current landscape of databases and data warehouses that are already in use.
Many customers across every market segment are interested in applying AI techniques to provide some kind of automated reasoning to many areas of real-time analytics where, traditionally, a human being or perhaps a rule-based expert system was the final arbitrer.
Late Arriving Fact scenario occurs when the transaction or fact data comes to data warehouse way later than the actual transaction occurred in the source application.
In facts data scenario, actual fact data created prior & sent later from source application to warehouse cause late arrival facts, the situation become messy because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past. It is important to be conceptually clear upon the nature of business process & the source application behavior.
Storemates is the 'airbnb of storage', an exciting entrant to the sharing economy and a start up that is ser to disrupt the booming self storage industry with its peer to peer model.
This deck is aimed at potential investors, interested owning a slice of the exciting Sharing Economy
In this talk, Rsqrd AI welcomes Kevin Powell, Director of Zestimates & AI Platform at Zillow! Kevin speaks about the technology and complexity behind the Zestimate and its impact at Zillow.
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matilli...Amazon Web Services
Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift
GE Power & Water develops advanced technologies to help solve some of the world’s most complex challenges related to water availability and quality. They had amassed billions of rows of data on on-premises databases, but decided to migrate some of their core big data projects to the AWS Cloud. When they decided to transform and store it all in Amazon Redshift, they knew they needed an ETL/ELT tool that could handle this enormous amount of data and safely deliver it to its destination. In this session, Ryan Oates, Enterprise Architect at GE Water, shares his use case, requirements, outcomes and lessons learned. He also shares the details of his solution stack, including Amazon Redshift and Matillion ETL for Amazon Redshift in AWS Marketplace. You learn best practices on Amazon Redshift ETL supporting enterprise analytics and big data requirements, simply and at scale. You learn how to simplify data loading, transformation and orchestration on to Amazon Redshift and how build out a real data pipeline. Get the insights to deliver your big data project in record time.
Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
FINRA’s Data Lake unlocks the value in its data to accelerate analytics and machine learning at scale. FINRA's Technology group has changed its customer's relationship with data by creating a Managed Data Lake that enables discovery on Petabytes of capital markets data, while saving time and money over traditional analytics solutions. FINRA’s Managed Data Lake includes a centralized data catalog and separates storage from compute, allowing users to query from petabytes of data in seconds. Learn how FINRA uses Spot instances and services such as Amazon S3, Amazon EMR, Amazon Redshift, and AWS Lambda to provide the 'right tool for the right job' at each step in the data processing pipeline. All of this is done while meeting FINRA’s security and compliance responsibilities as a financial regulator.
Managing Large Amounts of Data with SalesforceSense Corp
Critical "design skew" problems and solutions - Engaging Big Objects, MuleSoft, Snowflake and Tableau at the right time
Salesforce’s ability to handle large workloads and participate in high-consumption, mobile-application-powering technologies continues to evolve. Pub/sub-models and the investment in adjacent properties like Snowflake, Kafka, and MuleSoft, has broadened the development scope of Salesforce. Solutions now range from internal and in-platform applications to fueling world-scale mobile applications and integrations. Unfortunately, guidance on the extended capabilities is not well understood or documented. Knowing when to move your solution to a higher-order is an important Architect skill.
In this webinar, Paul McCollum, UXMC and Technical Architect at Sense Corp, will present an overview of data and architecture considerations. You’ll learn to identify reasons and guidelines for updating your solutions to larger-scale, modern reference infrastructures, and when to introduce products like Big Objects, Kafka, MuleSoft, and Snowflake.
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
Dev Lakhani, Data Scientist at Batch Insights talks on "Real Time Big Data Applications for Investment Banks and Financial Institutions" at the first Big Data Frankfurt event that took place at Die Zentrale, organised by Dataconomy Media
Migrate from Netezza to Amazon Redshift: Best Practices with Financial Engine...Amazon Web Services
Modernizing your data warehouse can unlock new insights while substantially improving query and data load performance, increasing scalability, and saving costs. In this chalk talk, we discuss how to migrate your Netezza data warehouse to Amazon Redshift and achieve agility and faster time to insight while reducing costs. Financial Engines joins us to share their migration journey, after which they were able to improve performance by 10x and reduce cost by 1/3 for a saving of ~$100K annually.
Streamlio and IoT analytics with Apache PulsarStreamlio
To keep up with fast-moving IoT data, you need technology that can collect, process and store data with performance and scalability. This presentation from Data Day Texas looks at the technology requirements and how Apache Pulsar can help to meet them.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.
AWS APAC Webinar Week - Real Time Data Processing with KinesisAmazon Web Services
Extracting real-time information from streaming data generated by mobile devices, sensors, and servers used to require distributed systems skills and writing custom code. This presentation will introduce Kinesis Streams and Kinesis Firehose, the AWS services for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and how Kinesis can help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll explore the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. This talk will also include key lessons learnt, architectural tips and design considerations in working with Kinesis and building real-time processing applications.
In this webinar, we will also provide an overview of Amazon Kinesis Firehose. We will then walk through a demo showing how to create an Amazon Kinesis Firehose delivery stream, send data to the stream, and configure it to load the data automatically into Amazon S3 and Amazon Redshift.
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
"No matter the industry, leading organizations need to closely integrate, deploy, secure, and scale diverse technologies to support workloads while containing costs. Nasdaq, Inc.—a leading provider of trading, clearing, and exchange technology—is no exception.
After migrating more than 1,100 tables from a legacy data warehouse into Amazon Redshift, Nasdaq, Inc. is now implementing a fully-integrated, big data architecture that also includes Amazon S3, Amazon EMR, and Presto to securely analyze large historical data sets in a highly regulated environment. Drawing from this experience, Nasdaq, Inc. shares lessons learned and best practices for deploying a highly secure, unified, big data architecture on AWS.
Attendees learn:
Architectural recommendations to extend an Amazon Redshift data warehouse with Amazon EMR and Presto.
Tips to migrate historical data from an on-premises solution and Amazon Redshift to Amazon S3, making it consumable.
Best practices for securing critical data and applications leveraging encryption, SELinux, and VPC."
Which Database is Right for My Workload?: Database Week San FranciscoAmazon Web Services
Database Week at the San Francisco Loft: Which Database is Right for My Workload?
Monday, August 27th
Managed Relational Databases on the Cloud
9:30AM–10:00AM
Check In
10:00AM–10:15AM
Database Services at AWS
Short overview of AWS Database and Analytics offerings and an overview of the day's topics.
Speaker: Bill Baldwin - Global Enterprise Support Lead, AWS
10:15AM-11:15AM
Relational Database Services at AWS
Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud. We’ll look at what RDS does (and does not) do to manage the “muck” of database operations.
Speakers:
Vishwajit Tigadi - Manager, Strategic Accounts, AWS
Bill Baldwin - Global Enterprise Support Lead, AWS
11:15AM-12:15PM
Hands-On Lab: Managed Database Basics
Hands-on Lab to set up and use RDS and Aurora. You’ll need a laptop with a Firefox or Chrome browser.
Speakers:
Vishwajit Tigadi - Manager, Strategic Accounts, AWS
Chris Holmes - Technical Account Manager, AWS
12:15PM-1:15PM
Lunch
1:15PM-1:45PM
Open Source Databases on the Cloud
Speaker: Miguel Cervantes - Associate Solutions Architect, AWS
1:45PM-2:15PM
Oracle and SQL Server on the Cloud
Speaker: Joyjeet Banerjee - Enterprise Solutions Architect, AWS
Speakers:
Miguel Cervantes - Associate Solutions Architect, AWS
Joyjeet Banerjee - Enterprise Solutions Architect, AWS
Which Database is Right for My Workload: Database Week SFAmazon Web Services
Database Week at the San Francisco Loft
Which Database is Right for My Workload?
Picking the right database based on imperfect data is challenging. Decades of traditional app development have conditioned us to put everything in a big box. In this session we will look at selecting the right database for the right job.
Level: 200
Speakers:
Joyjeet Banerjee - Enterprise Solutions Architect, AWS
Vishwajit Tigadi - Manager, Strategic Accounts, AWS
Picking the right database based on imperfect data is challenging. Decades of traditional app development have conditioned us to put everything in a big box. In this session we will look at selecting the right database for the right job.
Speakers:
Steve Abraham - Principal Database Specialist Solutions Architect, AWS
Charles Hammell - Principal Enterprise Architect, AWS
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
test test test test testtest test testtest test testtest test testtest test ...
Zestimate Lambda Architecture
1. 1
ZESTIMATE + LAMBDA ARCHITECTURE
Steven Hoelscher, Machine Learning Engineer
How we produce low-latency, high-quality home estimates
2. Goals of the Zestimate
• Independent
• Transparent
• High Accuracy
• Low Bias
• Stable over time
• Respond quickly to data
updates
• High coverage (about 100M
homes)
www.zillow.com/zestimate
3. In early 2015, we shared the original architecture of the
Zestimate…
…but a lot has changed
4. Then (2015)
• Languages: R and Python
• Data Storage: on-prem RDBMSs
• Compute: on-prem hosts
• Framework: in-house
parallelization library (ZPL)
• People: Data Analysts and
Scientists
Now (2017)
• Languages: Python and R
• Data Storage: AWS Simple
Storage Service (S3), Redis
• Compute: AWS Elastic
MapReduce (EMR)
• Framework: Apache Spark
• People: Data Analysts, Scientists,
and Engineers
So, what’s changed?
5. Lambda Architecture
• Introduced by Nathan Marz
(Apache Storm) and highlighted in
his book, Big Data (2015)
• An architecture for scalable, fault-
tolerant, low-latency big data
systems
Low Latency,
Accuracy
High Latency,
Accuracy
7. High-level Lambda Architecture
• We can process new data with
only a batch layer, but for
computationally expensive
queries, the results will be out-of-
date
• The speed layer compensates for
this lack of timeliness, by
computing, generally,
approximate views
9. PropertyId Bedrooms Bathrooms SquareFootage UpdateDate
1 2.0 1.0 1450 2010-03-13
1 2.0 2.0 1500 2015-05-15
1 3.0 2.5 1800 2016-06-24
Data is immutable
Below, we see the evolution of a home over time:
• Constructed in 2010 with 2 bedrooms and 1 bath
• A full-bath added five years later, increasing the square footage
• Finally, another bedroom is added as well as a half-bath
10. Data is eternally true
PropertyId Bathrooms UpdateTime
1 2.0 2015-05-15
1 2.5 2016-06-24
PropertyId SaleValue SaleTime
1 450000 2015-08-19
This bathroom value would have
been overwritten in our mutable
data view
This transaction in our training data
would erroneously use a bathroom
upgrade from the future
12. ETL
• Ingests master data
• Standardizes data across many sources
• Dedupes, cleanses and performs sanity checks on data
• Stores partitioned training and scoring sets in Parquet format
Train
• Large memory requirements (caching training sets for various models)
Score
• Scoring set partitioned in uniform chunks for parallelization
Batch Layer Highlights
13. • The number one source of Zestimate error is the facts that
flow into it – about bedrooms, bathrooms, and square
footage.
• To combat data issues, we give homeowners the ability to
update such facts and immediately see a change to their
Zestimate
• Beyond that, we want to recalculate Zestimates when
homes are listed on the market
Responding to data changes quickly
14. • Kinesis consumer is responsible
for low-latency transformations to
the data.
• Much of the data cleansing in the
batch layer relies on a
longitudinal view of the data, so
we cannot afford these
computations
• It looks up pertinent property
information in Redis and decides
whether to update the Zestimate
by calling the API
Speed Layer Architecture: Kinesis Consumer
15. Speed Layer Architecture: Zestimate API
• Uses latest, pre-trained models
from batch layer to avoid costly
retraining
• All property information required
for scoring is stored in Redis,
reusing a majority of the exact
calculations from the batch layer
• Relies on sharding of pre-trained
region models due to individual
model memory requirements
16. • The speed layer is not meant to be perfect; it’s meant to be lightning fast.
Your batch layer will correct mistakes, eventually.
• As a result, we can think of the speed layer view as ephemeral
PropertyId LotSize
0 21
1 16
2 5
Remember: Eventual Accuracy
Toy Example: Square feet or Acres?
Imagine a GIS model for validating lot
size by looking at a given property’s
parcel and its neighboring parcels. But
what happens if that model is slow
to compute?
17. • We still rely on our on-prem SQL
Server for serving Zestimates on
Zillow.com
• Reconciliation of views requires
knowing when the batch layer
started: if a home fact comes in
after the batch layer began, we
serve the speed layer’s
calculation
Serving Layer Architecture
18. The Big Picture
(3) Reduces
latency and
improves
timeliness
(2) Performs
heavy-lifting
cleaning and
training
(4)
Reconciles
views to
ensure better
estimation is
chosen
(1) Data is
immutable and
human-fault
tolerant
19. 19
SO DID YOU FIX MY
ZESTIMATE?
Andrew Martin, Zestimate Research Manager
20. Accuracy Metrics for Real-Estate
Valuation
• Median Absolute Percent Error (MAPE)
• Measures the “average” amount of error in in prediction in terms of
percentage off the correct answer in either direction
• Measuring error in percentages more natural for home prices since
they are heteroscedastic
• Percent Error Within 5%, 10%, 20%
• Measure of how many predictions fell within +/-X% of the true value
𝑀𝐴𝑃𝐸 = 𝑀𝑒𝑑𝑖𝑎𝑛
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒
𝑊𝑖𝑡ℎ𝑖𝑛 𝑋% =
𝑆𝑎𝑙𝑒𝑠
𝐴𝑏𝑠 𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖 − 𝑍𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑖
𝑆𝑎𝑙𝑒𝑝𝑟𝑖𝑐𝑒𝑖
< 𝑋%
21. Did you know we keep a public
scorecard?
www.zillow.com/zestimate/
22. Comparing Accuracy at 10,000FT
• Let’s focus on King County, WA since the new architecture has
been live here since January 2017
• We compute accuracy by using the Zestimate at the end of the
month prior to when a home was sold as our prediction
• i.e. if a home sold in Kent for $300,000 on April 10th we’d use the
Zestimate from March 31st
• We went back and recomputed Zestimates at month ends with
the new architecture for all homes and months 2016
• We compare architectures by looking at error on the same set of sales
Architecture MAPE Within 5% Within 10% Within 20%
2015 (Z5.4) 5.1% 49.0% 75.0% 92.5%
2017 (Z6) 4.5% 54.1% 81.0% 94.9%
24. Breaking Accuracy out by Home Type
Architecture Home
Type
MAPE Within5% Within10% Within20%
2015 (Z5.4) SFR
5.1% 49.2% 74.8% 92.4%
Condo 5.1% 49.5% 76.8% 93.7%
2017 (Z6) SFR 4.5% 54.6% 81.1% 94.6%
Condo 4.6% 53.4% 81.6% 96.0%
25. Think that you might have an idea for how to
improve the Zestimate? We’re all ears...
+
www.zillow.com/promo/zillow-prize
26. 26
We are hiring!
• Data Scientist
• Machine Learning Engineer
• Data Scientist, Computer Vision and Deep Learning
• Software Developer Engineer, Computer Vision
• Economist
• Data Analyst
www.zillow.com/jobs
Editor's Notes
Hi everyone, thanks for joining me here at Zillow for today’s meet up. My name is Steven Hoelscher, and I’m a machine learning engineer on the data science and engineering team.
I’ve been with Zillow for 2.5 years now and had the opportunity to work on the team responsible for building and rearchitecting a new Zestimate pipeline, largely inspired by Lambda Architecture.
It’s my hope that you’ll walk away from this presentation with a better understanding of what lambda architecture means and will have seen a in-production example for actually realizing it.
Without further ado, let’s start with the Zestimate itself and its goals. For those who aren’t familiar, the Zestimate is simply our estimated market value for individual homes nationwide. We strive to put a Zestimate on every rooftop, just as we see in this screenshot.
Every day, the Zestimate team thinks about how we can improve our algorithm, and from a data science perspective, improvement is based on whether we achieve these goals. To talk about a few: obviously, we would like our Zestimates to have high accuracy; when a home sells, it’s our goal for the Zestimate to be that near sale price. The Zestimate, as an algorithm, should also be stable over time and not exhibit erratic behavior day-to-day. The Zestimate should also be able to respond quickly to data updates. Users can supply us with more accurate data to improve our estimates, and their Zestimate should immediately reflect fact updates.
In a sense, these are the goals that our pipeline must support and we’re going to spend some more time talking about how to balance these goals in a big data system.
In early 2015, right around the time I started at Zillow, a few of my colleagues presented on the Zestimate architecture…as it was then. But a lot has changed since that presentation, only just 2 years ago.
At the core, the Zestimate in 2015 was largely written in R. Our team was comprised of R language experts and we even built an in-house R framework for parallelization a la MapReduce. We were a smaller team back then, mostly data scientists who also had a knack for engineering. We relied on collaboration with others teams, especially our database administrators to interface with on-premises relational databases.
Two years later, we’ve made a hiring push across all skill sets and invited engineers to join the fray. Python has become the new language of choice, thanks mostly to its long history of support in Apache Spark. We started leveraging more and more cloud-based services, such as Amazon’s Simple Storage Service for storing our data and Elastic MapReduce for compute. No longer are we bottlenecked by the size of a single machine.
With all of these changes, we had the opportunity to start afresh and design a system that would handle large amounts of data in the cloud, that would rely on horizontal scaling, and most importantly would meet the goals of the Zestimate.
Enter Lambda Architecture. The idea of Lambda Architecture was introduced by Nathan Marz, the creator of Apache Storm. I highly recommend the book he published in 2015 with the title *Big Data*. This book for the uninitiated provided the foundations for Lambda Architecture, with great case studies for understanding how to achieve this architecture.
Simply put, Lambda Architecture is a generic data processing architecture that is horizontally scalable, fault-tolerant (in the face of both human and hardware failures), and capable of low latency responses.
Shortly, we’ll see what a high-level lambda architecture looks like. But before we dive into that, I want to talk about making a tradeoff between latency and accuracy. In some cases, we cannot expect to have low latency responses when dealing with enormous amounts of data. As such, we have to tradeoff some degree of accuracy to reduce our latency. This idea will underpins Lambda Architecture.
Let’s look at example, highlighted by the Databricks team. Apache Spark implements an algorithm for calculating approximate percentiles of numerical data, with a function called approxQuantile. This algorithm requires a user to specify a target error bound and the result is guaranteed to be within this bound. This algorithm can be adjusted to trade accuracy against computation time and memory.
In the example here, the Databricks team studies the length of the text in each Amazon review. On the x-axis, we have the targeted residual. As we would guess, the higher the residual, the less computationally expensive our calculation becomes, but the tradeoff is accuracy.
Let’s start thinking about what this means for a big data processing system. We could start simple by building a batch system with low complexity. It reads directly from a master dataset, that contains all of the data so far. This batch layer, as it’s called, will virtually freeze the data at the time the job begins and start running computations. The problem is that once the batch layer finishes computing a query, the data is already out-of-date: new changes have come in and were not accounted for.
This is the gap that the lambda architecture is trying to solve. We can rely on a speed layer that will compensate for the batch layer’s lack of timeliness. But the speed layer, generally speaking, cannot rely on the same algorithms that the batch layer did. In the example before, we would want our batch layer to calculate a correct and highly accurate quantile, but the speed layer should rely on approximation to be more nimble.
In this way, at any given moment, we could have two different views: one view from the batch layer that is accurate but not so timely and one view from the speed layer that is less accurate but timely. Reconciling these two views, we can answer a query in a relatively accurate and timely fashion.
At this point, we’re going to explore a few of the layers of the Lambda Architecture and see how we implement each layer for the Zestimate itself. To begin, we start with the data. As I mentioned before, most of our data in 2015 was only stored on premises in relational databases. Our first goal, then, was to move this data to the cloud and start having new data-generating processes to write directly to the cloud store. At Zillow, we use AWS S3 for our data lake / master dataset. It is optimized to handle a large, constantly growing set of data. In our case, we have a bucket specific designated for raw data. In this design, we don’t want to actually modify or update the raw data and I’ll talk about why we don’t want to do this in a second here. As such, we set permissions on the bucket itself to prevent data deletes and updates. Any generic data-generating process is responsible for only appending new records to this object store, never deleting.
Most data-generating processes are writing JSON data. We do mandate a schema contract between the producers and consumers of the data, to ensure data types are conformed to.
Data is immutable. Let’s understand what this means by working through this example. We have a sample home and how it has evolved over time. In 2010, it was constructed with 2 bedrooms and 1 bathroom. Five years later, the homeowner added a full-bath, therefore increasing the square footage. This was done right before selling the home in a few months later in 2015. A new owner purchased the home, and nearly a year later, decided to add another bedroom and half-bath.With mutable data, this story is lost. One way of storing these attributes in a relational database would be to update records with the new attributes.
Data is eternally true. Now let’s introduce the transaction that I referred to. It occurred before the number of bathrooms changed again. In our mutable data view, this transaction would have been tied with a bathroom upgrade from the future.Once we attach a timestamp to data, we ensure it is eternally true. It is eternally true that in 2015, this home had 2 bathrooms, but in 2016, a half bath was added. This story is extremely important for data scientists. And while this example may be trivial, you can imagine tying a sale value to a larger set of home facts that weren’t actually true at that point of time.
Immutability of data allows us to retain this story. We’re no longer updating data, and as a benefit, we are less prone to human mistakes, especially when it comes to what all data scientists hold dear: the raw data itself.
After migrating our data to the AWS S3, we began work on the batch layer for the Zestimate pipeline. From a high-level, the Zestimate batch layer has a few components: first, we need to make available the raw, master dataset. Apache Spark allows us to read directly from S3, but some of our raw data sources suffer from the painful small-files problem in Hadoop. Simply put, big data systems expect to consume fewer large files rather than a lot of small files. Apache Spark suffers from this same problem. We rely heavily on vacuuming applications, such as Hadoop’s distcp, to aggregate data into larger files, by pulling from S3 and storing the aggregates on HDFS.
From there, our jobs read directly from HDFS: we begin with an ETL layer, responsible for producing training and scoring sets for our various models. Then, training and scoring takes place for about 100 M homes in the nation. Models, training and scoring sets, and performance metrics are all stored in a different bucket in S3, one for transformed data. This ensures that we’re distinguishing between the raw data (our master dataset) and the data derived from the raw data.
The ETL layer is responsible for interfacing with the master dataset and transforming it in order to arrive at cleaner, standardized datasets that are consumable by our Zestimate models. We have a wide variety of data sources that we deal with and so need to pull appropriate features from each to build a rich feature set. We invest a lot of time into ensuring our data is clean. As we know, garbage in, garbage out, and this holds true for the Zestimate algorithm. One example we always talk about is the case of fat-fingers. You can imagine that typing 500 square feet instead of 5000 square feet could drastically change how we perceive that home’s value. This cleaning process, in addition to the partitioning required, can be very expensive computationally. This is one area where a speed layer would need to be more nimble, as it won’t be able to look at historical data to make inferences about the quality of new data. After the ETL step, we can begin training models. Training, in our cases, requires large amounts of memory to support caching of training sets for various models. We train models on various geographies, making tradeoffs between data skew and volume of data available. Scoring is then done in parallel, using data partitioned in uniform chunks. At this point, we have a view created (the Zestimates for about 100M homes in the nation) as well as pre-trained models for the speed layer. But at this point, some of the facts that went into our model training and scoring could be out of date.
The number one source of Zestimate error is the facts that flow into it, like bedroom count, bathroom counts, and square footage.
We provide homeowners with a means for proactively making adjustments to their Zestimate. They can update a bathroom count or square footage and immediately see a change in their Zestimate.
Beyond that, we want to recalculate Zestimates when homes are listed on the market, because in these cases an off the market home is updated with all of the latest facts so that it is represented accurately on the market.
In lambda architecture, we want our speed layer to read from the same generic data-generating processes that our batch layer does. Amazon Kinesis (firehose and streams) makes it easy to both write to S3 as well as have consumers read directly from the stream. At this stage, you have the choice of which consumer to use. Spark Streaming can be used directly to enable code sharing (specifically, code relying on the Spark API) between the batch layer and the speed layer, but if Spark-specific code sharing is not a requirement, Amazon’s Kinesis Client Library (which Spark Streaming relies on) is a good solution.
In our case, we built our Kinesis Consumer with just the Kinesis Client Library, for three reasons: (1) simplicity, (2) lack of spark processing, and (3) Elastic MapReduce would be more expensive than a small Elastic ComputeCloud (EC2) instance.