15015 SRV318 Serverless Breakout Session Research at PNNL: Powered by AWS Pacific Northwest National Laboratory's rich data sciences capability has produced novel solutions in numerous research areas including image analysis, statistical modeling, and social media (and many more!). See how PNNL software engineers utilize AWS to enable better collaboration between researchers and engineers, and to power the data processing systems required to facilitate this work, with a focus on Lambda, EC2, S3, Apache Nifi and other technologies. Several approaches will be covered including lessons learned. AWS re:Invent 2017, Amazon, Giardinelli, Serverless, SRV318, EC2 11/28/2017 1:00:00 PM Tue Breakout Session
4. PNNL at a glance
$920.4 M
In R&D
expenditures
104
U.S. and foreign
patents granted
1,058
Peer-reviewed
publications
2 FLC Awards
5 R&D 100 Awards
4,400
Scientists, engineers
and non-technical staff
5. Software engineering at PNNL
• Staff focus is research and innovation, not operations
• Developers work with scientists to enable research
• Limited space and resources for hardware
• Big driver for moving to AWS!
• Agile is difficult
6. Problem: isolated research
• Who are the researchers
• Researchers work independently
• Focus on innovation and novel concepts
• Lack of collaboration with engineers
• Creates long delivery times
• Product usually isn’t what the customer
has envisioned
7. Enabling research with AWS
• Research is the life blood of the organization
• Researchers should not be troubled with environment
configurations, optimizations, etc.
• Software engineers provide expertise needed to build applied
solutions
• Utilizing AWS has been a turning point.
• AWS has dramatically helped to improve collaboration.
• AWS fits better with our Agile software processes
As a result, researchers can focus on the problem
8. Moving to the cloud
Our progression to AWS
Drivers
• Lack of resources internally (hardware and people)
• Customer deliverables and demands / deadlines
Concerns
• Cost
• Vendor lock-in
Initial Approach
• Fork-lift model
• Missed out on AWS services
• Still had operational headaches
Current Approach
• Serverless wherever possible
11. Image retrieval and classification - requirements
Requirements
• Handle static and streaming media
• Scalable, robust, and flexible
• Easily deployed and maintained
• Extensible (add additional models and instantiations)
• Identify optimal ways to collaborate
Research and customer requirements
15. NiFi overview
• Process and distribute data
• Message / data routing is very flexible and robust
• ETL is painless
• Easy to install, scale, configure, and extend
• Visually see what is going on with your pipelines
• Backpressure and queueing are baked into the flows—
excellent for systems that have brittle endpoints
• Low barrier to entry; broadens user audience
Where we find benefit and why we use it
17. NiFi tuning on AWS
• C4 and M4 EC2 instance types work well
• Scaling: we go vertical, then horizontal
• Keep normal CPU load at 50–60% CPU utilization
• Set provenance to Volatile
• General purpose SSDs work well
• Follow the NiFi “Configuration Best Practices” in the
admin guide
24. Lambda code example
public void handleRequest(SNSEvent snsEvent, Context context) throws Exception {
//get the JSON payload
String message = snsEvent.getRecords().get(0).getSNS().getMessage();
//parse JSON
//after retrieving the URL download the image
BufferedImage image = ImageIO.read(imageUrl);
//convert
ImageIO.write(image, “jpeg”, byteArrayOutputStream);
//save to S3
s3Client.putObject(new PutObjectRequest(bucketName, fileName, inputStream,metadata));
//write metadata to Dynamo
Table table = dynamoDB.getTable(dynamoDbTable);
Item item = new Item()
.withString("url_hash", request.getHash())
.withString("url", request.getUrl())
.withString("s3_bucket", s3Bucket);
table.putItem(item);
//create and send a notification
SendMessageRequest sendNewImageCachedMsg = new SendMessageRequest()
.withQueueUrl(queueUrl).withMessageBody(newImageJson);
amazonSqs.sendMessage(sendNewImageCachedMsg);
}
25. How research and engineering collaborated on the effort
Collaboration
26. Lessons learned
Where we find benefit and why we use it
• Fantastic for scaling
• Obvious choice
• Very performant when functions are loaded (warm)
• API is easy to use
• Just Java
• Used for two key situations
• Low cost development/pilot efforts
• High volume/throughput
27. Lessons learned (continued)
Where we find benefit and why we use it
• Cold start performance
• 30 s (cold) as opposed to 400 ms (warm)
• Legacy code vs new development
• Limits on jar sizes
• Message size on Amazon SNS
• 256 KB limit
• Combine functionality in a single Lambda function!
• Easier and cheaper to manage
• Step functions for our use cases were too expensive
31. Amazon Athena
• Great for exploring data in Amazon S3
• HQL / SQL support
• Partition support
• Use AWS Glue crawlers
• Complements Hadoop cluster
Where else can we use serverless?
35. In Summary
• More and more we lean on AWS serverless services
• We don’t have the resources for operations and maintenance
• Government customers we support prefer serverless solutions
• Makes it easier to provide researchers and engineers with flexible
blueprints for their implementations
• Focus on solving problems not setting up infrastructure
• What are your technical needs. Do we already have something similar?
• Leverage AWS environment to provide easy access to data, services, tools,
and resources
• Pleased with performance
• We can “brute force” solutions if we have to
• Most performance tuning is trivial
• Find most cost-effective use cases for your needs
• We have been able to strike a balance between serverless and managed
• Periodically do spot checks on cost. Upfront calculations may have been
incorrect
36. In Summary Cont’d
• Go-to tech stack
• Apache NiFi, Amazon S3, AWS Lambda, Amazon SQS, Amazon SNS,
Amazon DynamoDB, Amazon RDS, others as needed
• Take advantage of built-in events / triggers when you can
• Most of the time S3 + events are good enough
• “Free” capability
• We have abandoned Kafka in favor of Apache NiFi site-to-site or Amazon SQS
• Apache Kafka is great, just don’t have the administrative resources to
support. Use AWS alternative, when possible.
• Most “streaming” requests by our customers don’t really require streaming
• Request that researchers and engineers catalog their data and try to follow basic
data lake practices
• Keep raw and enriched / augmented separate
• Add metadata to known events and important time frames
• Enable start / stop and replay to improve evaluation