This session covers the most recent Big Data & IoT announcements at re:Invent. Learn about trends and use cases for understanding your data and implementing an Internet of Things (IoT) project. Hear about how AWS customers are using AWS IoT to connect their devices to the cloud and solve business challenges with IoT.
3. The diminishing value of data
• Recent data is highly valuable
• Old + Recent data is more valuable
4. Relational Database Service (RDS)
AMAZON AURORA
MySQL and PostgreSQL compatible
Several times faster than EC2/RDS
Highly available and durable
1/10th the cost of commercial grade
databases
re:Invent 2015: Thousands of customers
re:Invent 2016: 3.5X more customers
Today: Tens of thousands of customers
5. Aurora is the fastest growing service in the
history of AWS
6. Why AWS built Amazon Aurora
Speed and availability of high-end commercial databases
Simplicity and cost-effectiveness of open source databases
Drop-in compatibility with MySQL and PostgreSQL
Simple pay as you go pricing
Delivered as a managed service
7. Database architectures in last 30 years
Even when you scale it out, you’re still replicating the same stack
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
Application
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
Application
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
Storage
Application
8. A service-oriented architecture applied to the database
Moved the logging and storage layer into a
multitenant, scaled-out database-optimized
storage service
Integrated with other AWS services like
Amazon EC2, Amazon VPC, Amazon
DynamoDB, Amazon SWF, and Amazon
Route 53 for control plane operations
Integrated with Amazon S3 for continuous
backup with 99.999999999% durability
Data plane
Logging + Storage
SQL
Transactions
Caching
Amazon S3
1
2
3
9. Aurora Design
Seamless recovery from read replica failures
Auto-scale new read replicas
Up to 15 read replicas across 3 availability zones
Application
Read Replica 1
Master
Node
Read Replica 2
Shared Distributed Storage Volume
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
10. Application
Read/Write
Master 2
Read/Write
Master 1
Shared Distributed Storage Volume
Availability
Zone 1
Availability
Zone 2
Availability
Zone 3
Read/Write
Master 3
Zero application downtime from ANY node failure
Zero application downtime from ANY AZ failure
Multi-region coming in 2018
Faster write performance
Aurora Multi-Masters
First relational database service
with scale-out both read and write across multiple datacenters
(Preview Today)
12. Aurora Severless
On-demand, auto-scaling database for applications with unpredictable or
cyclical workloads
Automatically
scales capacity
up and down
Pay per second
and only for the
database capacity
you use
Starts up on
demand and shuts
down when not in
use
No need to
provision
instances
(Preview Today)
13. Starts up on demand, shuts down when not in use
Automatically scales with no instances to manage
Pay per second for the database capacity you use
Aurora Serverless
O n - d e m a n d , a u t o - s c a l i n g d a t a b a s e f o r a p p l i c a t i o n s w i t h
v a r i a b l e w o r k l o a d s
Warm Capacity
Pool
Application
15. Relational vs. non-relational databases
Traditional SQL NoSQL
DB
Primary Secondary
Scale up
DB
DB
DBDB
DB DB
Scale out
16. SQL vs. NoSQL schema design
NoSQL design optimizes for
compute instead of storage
17. WRITES
Replicated continuously to 3
Availability Zones
Persisted to disk (custom SSD)
READS
Strongly or eventually consistent
No latency trade-off
Designed to
support 99.99%
of availability
Built for high
durability
High availability and durability
19. Prime Day 2017 Metrics
Block Storage – Use of Amazon Elastic Block Store (EBS) grew by 40% year-over-year, with
aggregate data transfer jumping to 52 petabytes (a 50% increase) for the day and total I/O requests
rising to 835 million (a 30% increase).
NoSQL Database – Amazon DynamoDB requests from Alexa, the Amazon.com sites, and the
Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second.
Stack Creation – Nearly 31,000 AWS CloudFormation stacks were created for Prime Day in order to
bring additional AWS resources on line.
API Usage – AWS CloudTrail processed over 50 billion events and tracked more than 419 billion
calls to various AWS APIs, all in support of Prime Day.
Configuration Tracking – AWS Config generated over 14 million Configuration items for AWS
resources.
20. Build high performance, globally distributed applications
Low latency reads and writes to locally available tables
Disaster proof with multi-region redundancy
Easy to setup and no application re-writes required
DYNAMODB GLOBAL TABLES
First fully managed, multi-master, multi-region database
(GA)
29. DynamoDB Backup and Restore
First NoSQL database to automate on-demand and continuous
backups
Point in time restore for
short term retention and
data corruption protection
(Coming soon)
Back up hundreds of TB
instantaneously with NO
performance impact
On-demand backups for
long-term data archival
and compliance
(GA)
30.
31.
32.
33.
34.
35.
36. CHALLENGES BUILDING APPS WITH
HIGHLY CONNECTED DATA
Difficult to
maintain
high
availability
Difficult to
scale
Relational databases Existing graph databases
Limited
support for
open
standards
Too
expensiv
e
Unnatural for
querying graph
Inefficient
graph
processing
Rigid schema
inflexible for
changing
graphs
37. Available in preview
today
F A S T A N D
S C A L A B L E
E A S Y
Build powerful
queries easily with
Gremlin and
SPARQL
6 replicas of
your data
across 3 AZs
with full backup
and restore
R E L I A B L E
Supports Apache
TinkerPopTM and
W3C RDF graph
models
OPEN
F u l l y m a n a g e d g r a p h d a t a b a s e
Store billions of
relationships and
query with
milliseconds
latency
Amazon Neptune
39. Data Lake
How Big Data workloads look like
Collect Store Analyze Visualize
40. Data Lake on AWS
Amazon Redshift
+ Redshif t Spectru m
Amazon
QuickSight
Amazon EMR
Hadoop, Spark , Presto , Pig,
Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch Service
AWS Glue
S3 DATA LAKE
41. Amazon S3: Data Lake on AWS
Most ways to
bring data in
Best security,
compliance, and
audit capabilities
Object-level
controls
Unmatched durability,
availability,
and scalability
Twice as
many partner
integrations
Business
insights
into your data
43. The diminishing value of data
• Recent data is highly valuable
• Old + Recent data is more valuable
44. New API to select and retrieve data within objects
Accelerate any application that processes a subset of object data in S3
Improve data access performance by up to 400%
v
Powerful new S3 capability to pull out only the object data you need using standard SQL expressions
S3 SELECT
8 seconds
Without S3 Select
1.8 seconds
With S3 Select
4.5x faster3
aggregations
1
table
4
filters
COMPLEX PRESTO QUERY
Against a standard TPC-DS dataset
6 sub-queries with each containing:
(Preview Today)
45. Glacier SELECT
Run queries directly on data stored in Glacier
(GA)
Run queries on data stored at rest in Amazon
Glacier
Any application can query Glacier data
Retrieve only what you need
Makes Glacier part of your data lake
49. Round-trip latency
Intermittent connectivity
Expensive bandwidth
Programming and updating embedded software needs specialized skills
Limited to what is on the device unless you rewrite or program the device
Challenges Of Devices Living On The Edge
52. Use AWS
Greengrass console
to transfer models
to your devices
Inference on
the device
Devices take
action quickly –
even when
disconnected
AWS Greengrass ML Inference
Build and train
models in the
cloud
Run Machine Learning at the edge
(Preview Today)
Customers have also found tremendous value in being able to mine this data to make
better medicine,
tailored purchasing recommendations,
detect fraudulent financial transactions in real time,
provide on-demand digital content such as movies and songs,
predict weather forecasts,
the list goes on and on.
The core job of analytics is to help companies gain insight into their customers. Then, the companies can optimize their marketing and deliver a better product.
Data driven - > Netflix use case.
So how does Netflix use analytics?
“There are 33 million different versions of Netflix.”
– Joris Evers, Director of Global Communications
Netflix Uses Analytics To Select Movies, Create Content, and Make Multimillion Dollar Decisions
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
This is why enterprises have been moving, as fast as they can, as many of their databases to open source database engines like MySQL, MariaDB, and Postgres. However, to try and get the same performance from those open source engines that you get in the commercial grade databases is hard. It's possible. We've done a lot of it at Amazon. But it takes a lot of tuning.
Customers want to move from proprietary databases over to open source and want a fully managed environment with automated provisioning, configuration, tuning, patching and backups, all with lower cost and simple pay-as-you-go pricing.
We also heard from customers that in addition to the familiarity of open source databases, and the time saving benefits of managed database services in the cloud, they want the enterprise-grade performance and reliability that old-world databases offer, which is why we built Amazon Aurora.
1/ Grown by 2.5X again…tens of thousands of customers
2/ FINRA
3/ Expedia
4/ Verizon
5/ CBS Interactive
6/ Dow Jones
7/ Hulu
8/ TRANSITION: There are a lot of things people love about Aurora…
Before we started working on Aurora earlier this decade…
…because if you look at database architectures in last 30 years..
We radically changed this architecture with Amazon Aurora. We delivered a MySQL 5.6 compatible engine where we used distributed infrastructure of AWS to create a purpose built logging and storage system that sits completely outside the database box.
1/ Customers love the high performance and high availability they get from Aurora… and the scale-out architecture we’ve built is a big part of this.
2/ Customers can scale-OUT database read capacity by seamlessly adding up to 15 copies of your data through read replicas. This allows customers to scale to millions of database read statements per second.
3/ We also recently added the ability for Aurora to automatically add new read replicas (up to 15) as your application load on the database grows.
4/ This architecture also gives customers high levels of availability as Aurora will immediately route reads to an alternate replica if one of the read replicas fail.
5/ Customers love this architecture, but they’ve been asking us for even more scalability and availability…. In particular, they are asking us to scale out and provide seamless recovery not just for database reads, but also for database writes and to do this across multiple datacenters and multiple regions.
6/ Today Aurora databases run with a single master instance which processes all database write requests. If the master fails, Aurora will promote a read replica to become the new master in under 30 seconds. While this is considerably less than other databases for recovery of a master node, we asked ourselves if we could provide the same seamless recovery and scale-out for writes as we do for reads.
7/ And I am very excited to announce…
1/ Customers love the high performance and high availability they get from Aurora… and the scale-out architecture we’ve built is a big part of this.
2/ Customers can scale-OUT database read capacity by seamlessly adding up to 15 copies of your data through read replicas. This allows customers to scale to millions of database read statements per second.
3/ We also recently added the ability for Aurora to automatically add new read replicas (up to 15) as your application load on the database grows.
4/ This architecture also gives customers high levels of availability as Aurora will immediately route reads to an alternate replica if one of the read replicas fail.
5/ Customers love this architecture, but they’ve been asking us for even more scalability and availability…. In particular, they are asking us to scale out and provide seamless recovery not just for database reads, but also for database writes and to do this across multiple datacenters and multiple regions.
6/ Today Aurora databases run with a single master instance which processes all database write requests. If the master fails, Aurora will promote a read replica to become the new master in under 30 seconds. While this is considerably less than other databases for recovery of a master node, we asked ourselves if we could provide the same seamless recovery and scale-out for writes as we do for reads.
7/ And I am very excited to announce…
1/ As we’ve been discussing, customers love the performance, open source capability, and price of Aurora
2/ But, they have some workloads that don’t require databases to run often
3/ Could be bursty workloads like development and test, flash sales or blogs
4/ Could be unpredictable workloads like weather disaster sites
5/ Could be workloads that just get action at a couple times a day
6/ Yet, these customers have no option in the relational database market anywhere that doesn’t force them to buy the software and hardware and pay for it full time
7/ Customers have said, Hey I know this is hard but can you fix?
8/ Introducing….
...Aurora Serverless, an on-demand version of Aurora.
1/ Aurora Serverless has virtually all the same benefits as Aurora but…
2/ Doesn’t require you to provision instances
3/ Automatically scales capacity up and down when needed
4/ Starts up on demand and shuts down when not in use
5/ Pay per second and only for the database capacity you consume
1/ Another type of non relational database that has gained popularity is in-memory data stores which are commonly used as distributed data caches.
2/ Distributed caches are often used with relational and non-relational databases to speed-up data access.
3/ This is useful for when the application needs microsecond latency when millisecond latency is just not fast enough.
4/ Distributed caches also allow you to deliver massive throughput by keeping frequently stored information in memory, and reducing the load on your existing database.
5/ And, that is why we built ElastiCache.
6/ ElastiCache offers managed in memory stores for both Redis and MemcacheD
In addition to relational, key-value and in-memory stores, customers today are also building applications that rely on storing and navigating highly connected data.
Relational - Data is normalized. To enable joins, You are tied to a single partition and a single system. performance on the hardware specs of the primary server. To improve performance, Optimize -- Move to a bigger box. You may still run out of headroom. Create Read Replicas. You will still run out. Scale UP.
NoSQL -- NoSQL databases were designed specifically to overcome scalability issues. Scale “out” data using distributed clusters, low-cost hardware, throughput + low latency
Therefore, Using NoSQL, businesses can scale virtually without limit.
Generic product catalog. Table relationships in normalized.
A product could be a book – say the Harry Potter Series. There’s a 1:1 relationship. Or it could be a movie..
You can imagine the types of queries that you’d have to execute. 1. Show me all the movies starring. 2. the entire product catalog. This is Resource intensive – perform complex join
** NoSQL you have to ask – how will the application access the data?
optimize for the costlier asset. No joins. Just a select. Hierarchical structures. Designed by keeping in mind Access patterns.
Via duplication of data (storage), optimized for compute, it is fast.
13/35. 4 more regions. DynamoDB is highly durable. AWS has a concept of regions and Availability zones. AWS region is a geographic area. Each region has multiple availability zones. Each AZ has 1 or more physical DCs. They have redundant power and cooling, and interconnected via high speed low latency fiber. Take for example the AWS region in NVIrgina. It has 4 Azs.
When you create a DynamoDB table in Nvirgina, we will replicate the data to 3 Azs. All the data is stored in SSDs.
A lot of value built into DynamoDB– a few clicks.
1/ Can see with how many companies are using DynamoDB today…Snap, Lyft, Tinder, Redfin, Comcast, Under Armour, BMW, and Toyota
2/ As an increasing number of customers build geographically distributed applications with high performance and availability needs, these applications require the data in DynamoDB tables to be replicated, and locally available in multiple regions.
3/ Today, DynamoDB already provides intra-region data replication across three availability zones.
4/ In addition, customers can replicate their DynamoDB table data across multiple regions with an open source command line tool.
5/ However, this approach can be time consuming and complex to manage, and many of our customers have asked us whether we can automate this process for them.
6/ So today, we are pleased to announce…
1/ With Global Tables, DynamoDB is the first fully managed, multi-master, multi-region database.
2/ Since your data is now replicated across multiple regions, your globally distributed applications benefit from low latency reads and writes from the locally available tables.
3/ This is especially important for customers who have users all over the world and cannot afford to have any delay or lag in their application experience. For example, an Expedia customer using their mobile application in North America, should have the same responsive user experience when receiving personalized recommendations or updating their user profile when they travel to Europe or Asia. The application should be able to read from and write to a locally available database closest to the user no matter where the user is travelling in the world.
4/ Global Tables also ensures data redundancy across multiple regions and allows the database to stay available even in the event of a complete regional outage.
5/ Global Tables is easy to setup with a few clicks. Customers simply select the regions where data should be replicated, and DynamoDB handles the rest. This frees developers to focus on building their applications and business rather than worry about database administration tasks.
6/ DynamoDB Global Tables is Generally Available today, we hope you will check it out.
1/ To meet this need, today, we are adding On-Demand Backups and Point In Time Restore to DynamoDB
2/ With these new capabilities, DynamoDB is the first NoSQL database to automate on demand and continuous backups.
3/ On-Demand Backups allow customers to create FULL backups of their data instantaneously for long term data archival and to comply with corporate and governmental regulatory requirements.
4/ Point In Time Restore (PITR) allows customers to restore their data up to the minute for the past 35 days. When enabled, DynamoDB will continuously backup all customer data to protect from short-term loss due to application errors.
5/ Customers with data volumes that are 100s of TBs large, serving single digit millisecond latency workloads, can now back up their data instantaneously with NO performance impact to their production applications.
6/ No other cloud database provides this capability today.
7/ On Demand Backup is Generally Available today and Point in Time Restore is coming in early 2018…
1/ A lot of applications being built today need to understand and navigate relationships between highly connected data to enable use cases like social applications, recommendation engines, fraud detection, etc
2/ If you are building a restaurant recommendation app, you want the app to provide recommendations of restaurants of a certain cuisine like Sushi in a city like New York that at least two of the users friends also like.
3/ In all of these use cases, because the data is highly connected, it is easily represented as a graph shown here.
4/ TRANSITION: You can perform this song with a trumpet even though it calls for a saxophone, but a trumpet isn’t a saxophone…
1/ Today, customers build application on highly connected data with either their existing relational databases or with purpose-built graph databases. Both approaches today are sub-optimal.
2/ If you were to try and represent this data in a relational model, you would end up with multiple tables with multiple foreign keys and your queries would quickly become unwieldy involving nested queries and complex joins that won’t perform as your data size grows over time.
3/ Today’s graph database options are typically open source or commercially licensed.
4/ Open source editions are hard to scale and lack enterprise capabilities such as high availability and management.
5/ Commercial options are expensive and force customers to choose between either the Property Graph (e.g. Apache TinkerPop™) and RDF graph models regardless of their application needs. Support for Open APIs in such solutions tend to be bolt-ons and users often need to use proprietary APIs for best performance.
6/ What customers really want is a graph database service that is compatible with leading graph models, features open APIs, and is also fully managed, fast, scalable and cost effective…
7/ Introducing…
…Amazon Neptune, a fast, reliable, fully-managed graph database that makes it easy to build and run applications that need to work with highly connected datasets.
1/ Neptune gives developers flexibility by supporting both Tinkerpop and RDF graph models
2a/ It’s really fast and scalable – can create sophisticated, interactive graph applications storing billions of relationships and querying the graph with milliseconds latency
2b/ Neptune’s core is a purpose built high performance database engine optimized for graphs
2c/ Enables 15 low latency read replicas allowing 100K queries/second
3/ Very reliable – Four 9s of availability, fault tolerant and self healing storage built for cloud that replicates your data across 3 AZs and continuously backs up your data to S3
4/ Easy – Query processing engine optimized for both Gremlin and SPARQL (SPARKLE)
5/ Available today in preview
1/ So you can see in the new world of cloud born applications, a one-size-fits-all database model no longer works.
2/ All modern organizations will use multiple DB types, some multiple in same app
3/ At AWS, our goal is to provide you with the right tool for the job, and nobody has the breadth of DB capabilities available for you that AWS does
A foundation of highly durable data storage and streaming of any type of data
A metadata index and workflow which helps us categorise and govern data stored in the data lake
A search index and workflow which enables data discovery
A robust set of security controls – governance through technology, not policy
An API and user interface that expose these features to internal and external users
1/ For unstructured ad hoc queries on things like logs, raw event files, and click-stream data, Athena is a great solution
2/ For processing vast amounts of unstructured data across dynamically scalable clusters using popular distributed frameworks like Spark, Hadoop, Presto, Pig, Hive, Yarn (16 in all), EMR is a great solution (we have)
3/ For complex queries on large collections of unstructured data with super-fast performance, you can use Redshift as your data warehouse solution…and if you want to extend your queries beyond the optimized, local Redshift cluster, you can use Redshift Spectrum to extend these Redshift queries to run directly on your S3 data
4/ For customers wanting to run real-time operational intelligence and document search analysis, can run our managed Elasticsearch service
5/ For real-time processing of streaming data, customers are using Kinesis (especially for streaming data from edge connected devices to and from the cloud)
6/ For business intelligence and visualization, QuickSight
7/ And, to do ETL (extract, transform, and load) as well as move data around in the cloud and across data stores and analytics services, AWS has Glue…this is an unmatched set of analytics services that give customers the right tool for the right analytics need
The thing is that with a lot of the querying they don’t need all the data in the objects.
Today the way they do it-- whether you are querying in place or using your own applications,-- the analytics application has to take all the data out before it can process it, which adds cost and impacts performance.
People really just want to pinpoint that exact data they want to query and pull that out instead of the whole object.
For example, let’s say you’re running analytics on web site log file data related to iOS 10 users, and those users only represent 10% of the log file data you store in S3 objects. In this case, your analytics application really only needs to process a small subset of the data within a bunch of your S3 objects.
The analytics application needs to do too much hard work here.
1/It pull all the relevant objects out of S3.
2/find and extract the iOS 10 data from inside of the S3 objects used to store web site log file data
3/then it can finally process the data.
All of this means you are moving around and processing a lot of data that isn’t relevant to the query you want to run, so it slows things down and cost more than it should.
So we decided to give you a new way for your applications to run these queries.
With much more operating experience and scale, and a much broader set of features and capability than available anywhere else, S3 is the clear number one choice for a data lake. There are a few reasons why Amazon S3 is the world’s most popular cloud storage platform for data lakes.
1/ S3 has unmatched durability, availability, and scalability.
Only S3 replicates your data in three availability zones within a single region. This give you unmatched resilience to single data center issues like power failures.
Only S3 lets customers do cross region replication seamlessly without having to use a separate storage class,
Only S3 lets you choose which regions you want to replicate into (and as many as you want to replicate to).
2/ S3 has the best security, compliance, and audit capability.
Only S3 lets you replicate data from one region to another with cross-region replication using company-specific keys stored by Amazon’s key management service for the encryption between regions.
S3 Cross-region replication also lets you use separate accounts for the source and destination regions, protecting against malicious insider deletions of backup data.
S3 also has the most depth in security and compliance controls, offering capabilities you just can’t get in other options like the ability to audit how, when, and who is accessing individual objects in S3 through CloudTrail Data Events.
Amazon Macie, an AI-powered security service, automatically monitors CloudTrail S3 audit trails, detecting and alarming on anomalies that might indicate early stages of an attack like an outsider trying to enumerate role privileges for your storage.
S3 also offers a daily inventory report listing all the objects in a bucket, including important details like encryption status, for security report-outs on storage status.
3/ S3 offers object-level control at any scale.
Other providers force you to set policies broadly across all objects in a bucket, which customer find frustrating and too coarse for enterprise needs.
With S3, lifecycle policies can automatically delete or tier groups of objects that share a common tag or a prefix like a department code.
Only S3 gives you the option of setting multiple editable tags on an individual object, letting you label an object by project ID, compliance requirements, or other business taxonomy. This lets you set up lifecycle policies to delete or tier storage based on tag, or use the tags to restrict access to the object.
4/ S3 gives you business insight into your data.
Only S3 has Storage Class Analysis that analyzes storage request patterns and provides recommendations for setting up tiering to lower cost storage classes.
Export data that Storage Class Analysis uses for recommendations to a CSV file and use your favorite BI tool, like Quicksight, to generate custom reports like heat maps for groups of objects. Amazon Macie, which I mentioned earlier, automatically classifies your storage by content type so you understand how your storage changes over time.
5/ One more thing that is really important with a data lake is making it easy to ingest data, and AWS offers more ways to bring data into S3 than anyone else – by far.
AWS Snowball, which are physical devices that let you move petabytes of data into S3,
AWS Snowmobile for exabytes
Direct Connect, which is like your own private data pipe, and
S3 Transfer Acceleration, a unique way to make the Internet go up to 500% faster.
6/ As I said before, we have more than twice as many integrations with storage partners compared to any other cloud platform.
This means that it’s easy to use S3 with what you already have from folks like NetApp, EMC, Veritas and Cloudera,
Use cases like Primary Storage, Backup and Restore, Archive, Disaster Recovery and Analytics.
In addition to integration with most AWS services, the Amazon S3 ecosystem includes tens of thousands of consulting, systems integrator, and ISV partners,
AWS Marketplace offers 35 categories and more than 3,500 software listings from over 1,100 Independent Software Vendors that are pre-configured to deploy on the AWS Cloud.
No other cloud provider has more partners with solutions that are pre-integrated to work with Amazon S3.
1/ Pinterest
2/ Philips
3/ #m
4/ NTT Docomo
5/ GE
6/ TRANSITION: One of our analytics customers is Goldman Sachs…people sometime don’t realize how sophisticated and technically strong Fin Services cos are, but to share how they’re using AWS, it’s my pleasure to introduce Managing Director, Roy Joseph from Goldman Sachs
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
…S3 Select, a powerful new Amazon S3 capability to pull out only the object data you need using standard SQL expressions <PAUSE FOR CLAPPING>
1/ S3 Select dramatically improves the performance and reduces the cost of applications that need to query data in S3.
2/ Applications only retrieve a subset of data from an S3 object instead of retrieving the entire object. You filter the data using standard SQL expressions like SELECT, FROM, or WHERE.
3/ S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400%.
4/ Example: if a retailer needed to analyze the weekly sales data from just one store, but the data for all 200 stores was saved in a new object every day. Without S3 Select, the retailer’s analytics application would have to retrieve the complete set of S3 objects and then filter out just the required store data before being able to perform the analysis. With S3 Select, you can offload the heavy lifting of filtering data inside objects to the Amazon S3 service.
5/ Now, your analytics application just calls the S3 Select API to retrieve only the data from the one store you are interested in.
Your analytics application only has to process just the data for that store, greatly increasing performance and reducing the processing cost for your application.
1/ With S3 Select the same query time was reduced to 1.8 seconds.
2/ This reduced the query response time by 78%,
3/ 4.5X faster performance.
4/ S3 Select can be used to dramatically accelerate any application that queries data in Amazon S3.
5/ No other cloud object storage service can even come close to this kind of performance.
6/ Almost all AWS customers use S3 in some way, and so we are really excited about this new feature that makes the world’s best cloud storage even better – more cost effective, able to encompass more data, and optimized for analytics.
…S3 Select, a powerful new Amazon S3 capability to pull out only the object data you need using standard SQL expressions <PAUSE FOR CLAPPING>
1/ S3 Select dramatically improves the performance and reduces the cost of applications that need to query data in S3.
2/ Applications only retrieve a subset of data from an S3 object instead of retrieving the entire object. You filter the data using standard SQL expressions like SELECT, FROM, or WHERE.
3/ S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400%.
4/ Example: if a retailer needed to analyze the weekly sales data from just one store, but the data for all 200 stores was saved in a new object every day. Without S3 Select, the retailer’s analytics application would have to retrieve the complete set of S3 objects and then filter out just the required store data before being able to perform the analysis. With S3 Select, you can offload the heavy lifting of filtering data inside objects to the Amazon S3 service.
5/ Now, your analytics application just calls the S3 Select API to retrieve only the data from the one store you are interested in.
Your analytics application only has to process just the data for that store, greatly increasing performance and reducing the processing cost for your application.
1/ With S3 Select the same query time was reduced to 1.8 seconds.
2/ This reduced the query response time by 78%,
3/ 4.5X faster performance.
4/ S3 Select can be used to dramatically accelerate any application that queries data in Amazon S3.
5/ No other cloud object storage service can even come close to this kind of performance.
6/ Almost all AWS customers use S3 in some way, and so we are really excited about this new feature that makes the world’s best cloud storage even better – more cost effective, able to encompass more data, and optimized for analytics.
Customers have also found tremendous value in being able to mine this data to make
better medicine,
tailored purchasing recommendations,
detect fraudulent financial transactions in real time,
provide on-demand digital content such as movies and songs,
predict weather forecasts,
the list goes on and on.
The core job of analytics is to help companies gain insight into their customers. Then, the companies can optimize their marketing and deliver a better product.
Data driven - > Netflix use case.
So how does Netflix use analytics?
“There are 33 million different versions of Netflix.”
– Joris Evers, Director of Global Communications
Netflix Uses Analytics To Select Movies, Create Content, and Make Multimillion Dollar Decisions
Today there are billions of devices everywhere. They are in homes, factories, oil wells, agricultural fields, hospitals, cars, machinery, and thousands of other places. With the proliferation of these IoT devices, enterprises are increasingly having to manage infrastructure that is not located in a data center. In fact, when companies think about their on premises footprint in 10 years, servers will have moved to the cloud, and connected devices will be on-premises - literally everywhere.
The number of devices out there has been exploding because companies are finding that the closer they can collect and respond to the data at the source the better decisions they can make.
For example there are rainfall and weather sensors that can make irrigation more efficient and saves water. Light sensors can help make lighting more efficient. Sensors on jet engines help with operational efficiencies. And transportation sensors can help with route management to optimize on travel time and avoid accidents or weather related issues.
But the thing is, when you look at these devices, they tend to be relatively limited in their capabilities. They have a very small amount of CPU and a very small amount of disk. This is why the cloud is disproportionately important to these IoT devices.
Illumina
John Deere
BamTech/Statcast
Enel/Engie
BMW (collects sensor data from cars to give dynamically updated map info)
Under Armour Connected Fitness Platform (used by 180M customers WW)
I’m excited to announce Greengrass ML Inference, a new feature of Greengrass that brings machine learning to the edge . <PAUSE FOR CLAPPING>
Today, the most common way ML gets done is that first you go through the compute intensive process of building and training the model. Then you can use the model to recognizes patterns in new data and do inference for your applications. Usually, all of this is done in the cloud.
But with devices at the edge, there is a big advantage if you can do inference right on the device itself, so you can reduce the time and cost of sending the device data up to the cloud to get your predict and then waiting to get the result back down to the device to take action.
Greengrass ML Inference is different. With ML@ Edge, you build and train your models in the cloud, using a service like Sagemaker or whatever you want. Then you use the Greengrass console to transfer the models down to your devices so inference can be done right on the device itself. This lets your devices make smart decisions quickly even when they are disconnected.
With Greengrass ML Inference, application developers can add machine learning to their devices without having any special machine learning skills.
Greengrass ML Inference, and it changes what is possible for IoT, machine learning, and the edge. Its super cool, and we think people will be excited to try it.