Slides for my presentation at SoCal Code Camp, June 29, 2014 (http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=68942cd0-6714-4753-a218-20d4b48da07d)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=8cdfd955-2cd4-44a2-ad08-5353e079685a
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)Kai Chan
These are the slides for the session I presented at SoCal Code Camp Los Angeles on November 10, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=cc1e6803-b0ec-4832-b8df-e15ea7bd7694
Search Engine-Building with Lucene and SolrKai Chan
These are the slides for the session I presented at SoCal Code Camp San Diego on July 27, 2013.
http://www.socalcodecamp.com/socalcodecamp/session.aspx?sid=6b28337d-6eae-4003-a664-5ed719f43533
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Data Con LA
Data transformation has traditionally required expertise in specialized data platforms and typically been restricted to the domain of IT. A domain specific language (DSL) separates the user’s intent from a specific implementation, while maintaining expressivity. A user interface can be used to produce these expressions, in the form of suggestions, without requiring the user to manually write code. This higher level interaction, aided by transformation previews and suggestion ranking allows domain experts such as data scientists and business analysts to wrangle data while leveraging the optimal processing framework for the data at hand.
At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner.
The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively.
Hunter Kelly @retnuh
tech.zalando.com
In this presentation, Amit explains querying with MongoDB in detail including Querying on Embedded Documents, Geospatial indexing and Querying etc.
The tutorial includes a recap of MongoDB, the wrapped queries, queries which are using modifiers, Upsert (saving/ updating queries), updating multiple documents at once, etc. Moreover, it gives a brief explanation about specifying which keys to return, the AND/OR queries, querying on embedded documents, cursors and Geospatial indexing. The tutorial begins with a section about MongoDB which includes steps to install and start MongoDB, to show and select Database, to drop collection and database, steps to insert a document and get up to 20 matching documents. Furthermore, it also includes steps to store and use Javascript functions on the server side.
The next section after the MongoDB section is about wrapped queries and queries using modifiers which includes the types of wrapped queries which are used like LikeQuery, SortQuery, LimitQuery, SkipQuery. It also includes the types of queries using modifiers like NotEqualModifier, Greater/Lesser modifier, Increment Modifier, Set Modifier, Unset Modifier, Push Modifier etc. Then comes the section about Upsert (Save or update). There are steps mentioned for saving or updating queries in this section.
At the same time, there are steps to update multiple documents altogether. The next section which is called “specifying which keys to return” talks about ways to specify the keys the user wants. After this section comes OR/AND query. It informs us about the general steps to do an OR query. Also, it includes the general steps to do an AND query. After this section comes another section called “querying on embedded document” which tells the user about ways of querying for an embedded document.
One of the important sections of this tutorial is about cursors, uses of a cursor and also methods to chain additional options onto a query before it is performed. Following is a section about indexing which talks about indexing as a term and how indexing helps in improving the query’s speed. At the end is a section which gives a brief explanation on geospatial indexing which is another type of query that became common with the emergence of mobile devices. Also, it includes the ways geospatial queries can be performed.
Writing a TSDB from scratch_ performance optimizations.pdfRomanKhavronenko
I'm one of maintainers of the open source TimeSeries database VictoriaMetrics, written in Go. It is used for APM or Kubernetes monitoring. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, 20-40 million samples/s on the read path. The biggest installations have more than 100 million samples/s on the ingestion path for a single cluster. This requires being clever with data processing to keep it efficient and scalable. In the talk, I'll cover the following optimizations for keeping the database fast:
1. String interning for lowering GC pressure. We use string interning for storing time series metadata (aka labels). However, this approach has downside of increased memory usage. When it is worth it to use string interning?
2. Metadata processing may require many regular expression matching and strings modification operations. Caching results of such operations helps to save CPU. But the downside could be increased memory usage. Which operations should be cached and which are not?
3. Limiting the number of concurrently running goroutines with CPU-bound load by the number of available CPU cores. This helps to control the memory usage on load spikes (which is a frequent event in monitoring). The limit also improves the processing speed of each goroutine, since it reduces the number of context switches. The downside of the approach is its complexity - it is easy to make a mistake and end up with a deadlock or inefficient resource utilization.
4. The better understanding of `sync.Pool`. For us, `sync.Pool` shows itself the best when used in CPU-bound code, while in IO-bound code it leads to excessive memory usage. The CPU-bound code has short ownership over the objects retrieved from the pool. In combination with p.3 (limited number of goroutines processing CPU-bound code) it gives the most efficient processing speed and memory usage since the chance to get a "hot" object from the pool is much higher.
At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner.
The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively.
Hunter Kelly @retnuh
tech.zalando.com
In this presentation, Amit explains querying with MongoDB in detail including Querying on Embedded Documents, Geospatial indexing and Querying etc.
The tutorial includes a recap of MongoDB, the wrapped queries, queries which are using modifiers, Upsert (saving/ updating queries), updating multiple documents at once, etc. Moreover, it gives a brief explanation about specifying which keys to return, the AND/OR queries, querying on embedded documents, cursors and Geospatial indexing. The tutorial begins with a section about MongoDB which includes steps to install and start MongoDB, to show and select Database, to drop collection and database, steps to insert a document and get up to 20 matching documents. Furthermore, it also includes steps to store and use Javascript functions on the server side.
The next section after the MongoDB section is about wrapped queries and queries using modifiers which includes the types of wrapped queries which are used like LikeQuery, SortQuery, LimitQuery, SkipQuery. It also includes the types of queries using modifiers like NotEqualModifier, Greater/Lesser modifier, Increment Modifier, Set Modifier, Unset Modifier, Push Modifier etc. Then comes the section about Upsert (Save or update). There are steps mentioned for saving or updating queries in this section.
At the same time, there are steps to update multiple documents altogether. The next section which is called “specifying which keys to return” talks about ways to specify the keys the user wants. After this section comes OR/AND query. It informs us about the general steps to do an OR query. Also, it includes the general steps to do an AND query. After this section comes another section called “querying on embedded document” which tells the user about ways of querying for an embedded document.
One of the important sections of this tutorial is about cursors, uses of a cursor and also methods to chain additional options onto a query before it is performed. Following is a section about indexing which talks about indexing as a term and how indexing helps in improving the query’s speed. At the end is a section which gives a brief explanation on geospatial indexing which is another type of query that became common with the emergence of mobile devices. Also, it includes the ways geospatial queries can be performed.
Writing a TSDB from scratch_ performance optimizations.pdfRomanKhavronenko
I'm one of maintainers of the open source TimeSeries database VictoriaMetrics, written in Go. It is used for APM or Kubernetes monitoring. The average VictoriaMetrics installation is processing 2-4 million samples/s on the ingestion path, 20-40 million samples/s on the read path. The biggest installations have more than 100 million samples/s on the ingestion path for a single cluster. This requires being clever with data processing to keep it efficient and scalable. In the talk, I'll cover the following optimizations for keeping the database fast:
1. String interning for lowering GC pressure. We use string interning for storing time series metadata (aka labels). However, this approach has downside of increased memory usage. When it is worth it to use string interning?
2. Metadata processing may require many regular expression matching and strings modification operations. Caching results of such operations helps to save CPU. But the downside could be increased memory usage. Which operations should be cached and which are not?
3. Limiting the number of concurrently running goroutines with CPU-bound load by the number of available CPU cores. This helps to control the memory usage on load spikes (which is a frequent event in monitoring). The limit also improves the processing speed of each goroutine, since it reduces the number of context switches. The downside of the approach is its complexity - it is easy to make a mistake and end up with a deadlock or inefficient resource utilization.
4. The better understanding of `sync.Pool`. For us, `sync.Pool` shows itself the best when used in CPU-bound code, while in IO-bound code it leads to excessive memory usage. The CPU-bound code has short ownership over the objects retrieved from the pool. In combination with p.3 (limited number of goroutines processing CPU-bound code) it gives the most efficient processing speed and memory usage since the chance to get a "hot" object from the pool is much higher.
go-git is a 100% Go libray used to interact with git repositories. Even if it already supports most of the functionality it still lags a bit in performance when compared with the git CLI or some other libraries. I'll explain some of the problems that we face when dealing with git repos and some examples of performance improvements done to the library.
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Dart is a new language for the web, enabling you to write JavaScript on a secure and manageable way. No need to worry about "JavaScript: The bad parts".
This presentation concentrates on the developer experience converting from the Java based GWT to Dart.
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
Apidays Paris 2023 - Software and APIs for Smart, Sustainable and Sovereign Societies
December 6, 7 & 8, 2023
Forget TypeScript, Choose Rust to build Robust, Fast and Cheap APIs
Zacaria Chtatar, Backend Software Engineer at HaveSomeCode
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
Apache Tajo is an open source big data warehouse system on Hadoop. This slide shows two high-tech efforts for performance improvement in Tajo project. First one is query optimization including cost-based join order and progressive optimization. The second effort is JIT-based vectorized processing.
Java heap memory model has wasteful memory usage. References, object headers, internal collection structure, extra fields such as String.hashCode… This talk shows practical ways to reduce memory usage and fit more data into memory: primitive types, specialized java collections, bit packing, reducing number of pointers, replacing String with char[], semi-serialized objects… As bonus we get lower GC overhead by reducing number of references.
Talking about Neo4j after 1 year of using it production. This presentation covering db structure(internals), cypher queries, extensions development, db tuning & settings.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
5. Ways to Concatenate Strings
● operators (+ and +=)
● String’s concat method
● StringBuilder
6. String s = a + b + c + d
same bytecode as:
String s = new
StringBuilder().append(a).append(b)
.append(c).append(d).toString()
7. String s = s + a + b + c + d
String s += a + b + c + d
same bytecode as:
String s = new
StringBuilder().append(s).append(a)
.append(b).append(c).append(d).toSt
ring()
9. StringBuilder result = new
StringBuilder();
while (stillRunning()) {
…
result.append(a);
…
}
10. String result = “”;
while (stillRunning()) {
…
result += a;
…
}
11. String result = “”;
while (stillRunning()) {
…
result = new StringBuilder()
.append(s).append(a).toString();
…
}
12. String result = “”;
while (stillRunning()) {
…
result = new StringBuilder()
.append(s).append(a).toString();
…
}
copy characters from
String to StringBuilder
copy characters from
StringBuilder to String
create a new object
13. String Concatenation Example
● 100000 iterations
● a 1 byte String appended per iteration
● 100 KB of data appended
// initialize result variable
…
for (int i = 1; i <= 100000; i++) {
// append 1-byte string to result
}
14. + or += operator
● iteration i:
o copy (i - 1) bytes from String to StringBuilder
o append 1 byte to StringBuilder
o copy i bytes from StringBuilder to String
● total: copy 1.4 GB of data
15. String’s concat method
● iteration i:
o copy (i - 1) bytes from the old String, plus 1 byte to
append with, to a new String
● total: copy 672 MB of data
16. StringBuilder
● when needed to expand
o create a new buffer at least 2x as large
o copy data over
● to grow the buffer from 1 byte to >= 100000
bytes: copy 128 KB of data
● plus, copy 98 KB of data from outside
StringBuilder
● total: copy 226 KB of data
19. public String getText() {
StringBuilder result = new
StringBuilder();
…
return result.toString();
}
copies all characters in the StringBuilder
to a String
22. ● no direct support
o cannot have List<int>, Map<int, long>
● indirect support via boxing
o “wrapper” classes, e.g. Integer, Long
o can have List<Integer>, Map<Integer, Long>
● memory cost
Primitives Collection in Java
23. Object Metadata
● object’s class
● lock/synchronization state
● array size (if the object is an array)
● total: ~8-12 bytes for 32-bit JVM
(more for 64-bit JVM)
25. ArrayList<Integer> w/ 1000 Elements
metadat
a
size
8 bytes 4 bytes 4 or 8 bytes
metadata
value
metadata
value
metadata
value
8 bytes 8 bytes 8 bytes
4 bytes 4 bytes 4 bytes
size = 16028 bytes (if OOP size = 32 bits)
size = 20032 bytes (if OOP size = 64 bits)
metadat
a
element 999element 0
12 bytes 4 or 8 bytes 4 or 8 bytes 4 or 8 bytes
…element 1
elementData
26. Map: Even More Overhead
● both key and value can be boxed primitives
● some types of map half-full by design
● e.g. HashMap<Integer, Integer>
27. Primitive Collection Implementations
● Trove
o http://trove.starlight-systems.com/
o does NOT implement Java’s Collection interface
o iteration by callback object/method
28. Primitive Collection Implementations
● fastutil
o http://fastutil.di.unimi.it/
o type-specific maps, sets, lists
o sorted maps and sets
o implement Java’s Collection interface
o primitive iterators
o latest version’s binary JAR file is 17 MB
29. Convert Primitive Array To List
● you have an int[]
● some method only takes a List
(e.g. List<Integer>)
● Arrays.asList(int[]): type mismatch
30. // someFunc takes a List, read-only
// a: int[]
list = new
ArrayList<Integer>(a.length);
for(int i = 0; i < a.length; i++){
list.add(a[i]);
}
someFunc(list);
31. // someFunc takes a List, read-only
// a: int[]
someFunc(Ints.asList(a));
● Ints.asList: returns a fixed-size list backed by
the specified array
● Ints provided by Guava (from Google)
35. while (stillRunning()) {
…
// str and regex: String
if (Pattern.compile(regex)
.matcher(input)
.matches()) {
…
}
…
}
36. while (stillRunning()) {
…
// str and regex: String
if (Pattern.compile(regex)
.matcher(input)
.matches()) {
…
}
…
}
generate a Pattern object from the RE pattern
(expensive)
38. RE Benchmark
● Java Regular expression library benchmarks
o http://tusker.org/regex/regex_benchmark.html
● many alternative implementations
● speed and capability vary
41. Caching: Example Scenario
● calculate file checksums
● calculation is expensive
● lots of files
● repeated calculations
● solution: cache the checksum
o key: file path
42. Possible Solution: HashMap
● implement cache with
HashMap<String, byte[]>
● problem:
o assume cache will be “small enough”, or
o need to remove entries (how?)
43. Possible Solution: WeakHashMap
● implement cache with
WeakHashMap<String, byte[]>
● problem:
o keys are weak references
o if no strong reference on a key,
the entry is eligible for GC
44. Solution: Libraries
● Apache Commons Collections
o LRUMap class
o LRU (least recently updated) cache algorithm
o fixed max size
45. Solution: Libraries (cont.)
● Guava
o CacheBuilder and Cache classes
o accessible by multiple threads
o eviction by: max size, access time, write time, weak
keys, weak values, soft values
o CacheLoader class: creates entry on demand
o remove listeners
47. Lossless Compression Today
● compression saves more space than ever
● store more in faster medium (memory, SSD)
● compression ! == slow anymore
● new algorithms target high throughput
o > 300 MB/s compression
o > 1000 MB/s decompression
48. LZ4
● faster than Deflate with zlib
● decent ratio
● LZ4_HC
o slower compression speed
o higher compression ratio
o decoding side unchanged
● ported to Java
o https://github.com/jpountz/lz4-java
50. Memory-Mapped File Intro
● MappedByteBuffer
● content: memory-mapped region of a file
● OS handles reading, writing, caching
● can be used as persistent, shared memory
51. Memory-Mapped File Caveats
● file size limited by address space
o 64-bit JVM: not an issue
o 32-bit JVM: file size limited to <= 2 GB
● no “close function”
o you can’t control when resources are free
o not suitable for opening many files at/around the
same time
53. VisualVM
● bundled with Java SE
● local or remote (JMX) applications
● monitor
o CPU usage
o heap and PermGen usage
o class and thread counts
54. VisualVM
● heap dump
o instance and size by class
o biggest object by retain size
● sampler (fast)
● profiler (slow)
55. Sampling Example
● run 4 methods
o concatByAppendresult.append(x());
o concatByConcatresult =
result.concat(x());
o concatByPlusOperatorresult = result + x();
o concatByPlusEqualOperatorresult += x();
● if same speed, each would get 25% CPU
56.
57. Thanks for Coming!
● slides available
o http://bit.ly/sdcodecamp2014java
● please vote for my conference session
o http://bit.ly/tvnews2014
● questions/feedback
o kai@ssc.ucla.edu
● questions?
Editor's Notes
tips covered here:
* most are learned and tested in production
* some are not (but interesting enough)
performance goals:
* finish a task in less time, given same amount of parallelization
* use less memory
language war:
* there are places for that
* use the right tool for the right job
* programmers can make things work better
exhaustive list or really in-depth analysis:
* there are books for that
* we only have 1 hour
* get you interesting in this topic, if you are not already
some truth for the saying that the compiler uses StringBuilder behind the scene
some truth for the saying that the compiler uses StringBuilder behind the scene
* when the concatenation is done inside a loop, things become more complex
* here’s how the code looks like if you use StringBuilder explicitly
* what if you use the + or += operator and hope the compiler optimizes things for you?
* turns out, the compiler rewrites the code into this
* note these 3 operations within the loop
* if you think all these object creation and copying slow things down, you’re probably right
* using StringBuilder explicitly results in much less copying of data
* in short, if you’re concatenating strings inside a loop, use StringBuilder explicitly
* CharSequence: a less well-known interface
* however, the Writer support writing a CharSequence’s content to a file, etc.
* CharSequenceReader (in Apache Commons IO library) to read from CharSequence
* whenever we have a method that returns a String, and we have to convert something into a String return value, ask: do we really have to do the conversion?
* might be able to modify to method to return another type, like CharSequence
* might be able to avoid the conversion
Zip and GZip use same algorithm
Deflate
zlib: open-source implementation
compression saves more space than ever
compression ! == slow anymore
CPU much faster than I/O
compressed data: less I/O
* MappedByteBuffer: get from RandomAccessFile and FileChannel
http://www.onjava.com/pub/a/onjava/2002/10/02/javanio.html?page=3