This document discusses key concepts for modern software design in big data systems. It covers topics like data structures, algorithms, distributed systems, and performance optimization. Specifically, it discusses techniques like caching, compression, locality, immutability, and consistency models. It provides examples from systems like MapReduce, Hadoop, Spark, Cassandra and Google. The goal is to understand principles for designing scalable, fault-tolerant and high performance big data systems.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
Abstract: The presentation describes
- What is the BigData problem
- How Hadoop helps to solve BigData problems
- The main principles of the Hadoop architecture as a distributed computational platform
- History and definition of the MapReduce computational model
- Practical examples of how to write MapReduce programs and run them on Hadoop clusters
The talk is targeted to a wide audience of engineers who do not have experience using Hadoop.
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
Abstract: The presentation describes
- What is the BigData problem
- How Hadoop helps to solve BigData problems
- The main principles of the Hadoop architecture as a distributed computational platform
- History and definition of the MapReduce computational model
- Practical examples of how to write MapReduce programs and run them on Hadoop clusters
The talk is targeted to a wide audience of engineers who do not have experience using Hadoop.
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
it is bit towards Hadoop/Hive installation experience and ecosystem concept. The outcome of this slide is derived from a under published book Fundamental of Big Data.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
This presentation will make reader understand about the flow mechanism of data in the HDFS cluster with some basic points discussed on Resource Management.
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh
Most modern databases concern themselves with their ability to scale a workload beyond the power of one machine. But maintaining a database across multiple machines is inherently more complex than it is on a single machine. As soon as scaling out is required, suddenly a lot of scaling out is required, to deal with new problems like index suitability and load balancing.
Write optimized data structures are well-suited to a sharding architecture that delivers higher efficiency than traditional sharding architectures. This talk describes a new sharding architecture for MongoDB applications that can be achieved with write optimized storage like TokuMX's Fractal Tree indexes.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Hadoop is a well-known framework used for big data processing now-a-days. It implements MapReduce for processing and utilizes distributed file system known as Hadoop Distributed File System (HDFS) to store data. HDFS provides fault tolerant, distributed and scalable storage for big data so that MapReduce can easily perform jobs on this data. Knowledge and understanding of data storage over HDFS is very important for a researcher working on Hadoop for big data storage and processing optimization. The aim of this presentation is to describe the architecture and process flow of HDFS. This presentation highlights prominent features of this file system implemented by Hadoop to execute MapReduce jobs. Moreover the presentation provides the description of process flow for achieving the design objectives of HDFS. Future research directions to explore and improve HDFS performance are also elaborated on.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
This presentation will make reader understand about the flow mechanism of data in the HDFS cluster with some basic points discussed on Resource Management.
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh
Most modern databases concern themselves with their ability to scale a workload beyond the power of one machine. But maintaining a database across multiple machines is inherently more complex than it is on a single machine. As soon as scaling out is required, suddenly a lot of scaling out is required, to deal with new problems like index suitability and load balancing.
Write optimized data structures are well-suited to a sharding architecture that delivers higher efficiency than traditional sharding architectures. This talk describes a new sharding architecture for MongoDB applications that can be achieved with write optimized storage like TokuMX's Fractal Tree indexes.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Hadoop is a well-known framework used for big data processing now-a-days. It implements MapReduce for processing and utilizes distributed file system known as Hadoop Distributed File System (HDFS) to store data. HDFS provides fault tolerant, distributed and scalable storage for big data so that MapReduce can easily perform jobs on this data. Knowledge and understanding of data storage over HDFS is very important for a researcher working on Hadoop for big data storage and processing optimization. The aim of this presentation is to describe the architecture and process flow of HDFS. This presentation highlights prominent features of this file system implemented by Hadoop to execute MapReduce jobs. Moreover the presentation provides the description of process flow for achieving the design objectives of HDFS. Future research directions to explore and improve HDFS performance are also elaborated on.
D4476, a cell-permeant inhibitor of CK1, potentiates the action of Bromodeoxy...Atai Rabby
To elucidate the mechanism of bromodeoxyuridine (BrdU) induced cellular senescence, we treated HeLa cells with D4476, a potent and specific inhibitor of casein kinase 1(CK1). We found that D4476 (10µM) treatment could arrest cell growth at G1 stage and induced cellular senescence when treated together with BrdU (10µM). However neither D4476 nor BrdU can induce cellular senescence alone, at a concentration of 10µM. These results suggest that the targets of CK1 may be involved in maintaining normal cellular process and their inactivation potentiates BrdU to induce senescence like phenomena.
A survey of recent research in image fusion for infra red sensors is presented. An appropriate approach is suggested based on requirements and related work.
Identifying Antibiotics posing potential Health Risk: Microbial Resistance Sc...Atai Rabby
The present study was undertaken to investigate the trends of antimicrobial resistance and identify antibiotics that are posing public health risk due to resistant microbes in Bangladesh. Antimicrobial resistance data of Bangladesh for last 10 years were searched out and compared with corresponding antibiotic consumption rates. In this study, a factor is introduced to identify the therapeutic sub-class of antibiotics that are mostly threatened by growing antimicrobial resistance. Highly resistance trend against several antibiotics such as cloxacillin, ampicillin, metronidazole, oxacillin, amoxicillin, tetracycline, cotrimoxazole, penicillin etc. were also indentified. Heat map analysis of this study revealed that nine antimicrobial agents: metronidazole, amoxicillin, tetracycline, cotrimoxazole, cephadine, penicillin, ciprofloxacin, doxycycline and nalidixic acid are associated with public health risk due to growing bacterial resistance. This study would significantly contribute in minimizing development and spread of antibiotic resistance by revealing the microbial resistance scenario and aid the effective antibiotic treatment options in Bangladesh.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
Processing of large data requires new approaches to data mining: low, close to linear, complexity and stream processing. While in the traditional data mining the practitioner is usually presented with a static dataset, which might have just a timestamp attached to it, to infer a model for predicting future/takeout observations, in stream processing the problem is often posed as extracting as much information as possible on the current data to convert them to an actionable model within a limited time window. In this talk I present an approach based on HBase counters for mining over streams of data, which allows for massively distributed processing and data mining. I will consider overall design goals as well as HBase schema design dilemmas to speed up knowledge extraction process. I will also demo efficient implementations of Naive Bayes, Nearest Neighbor and Bayesian Learning on top of Bayesian Counters.
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
by Darin Briskman, Technical Evangelist, AWS
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms. Level: 200
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
4. Numbers Everyone Should Know
(taken from Jeff Dean – Google keynote)
•L1 cache reference 0.5 ns
•Branch mispredict 5 ns
•L2 cache reference 7 ns
•Mutex lock/unlock 25 ns
•Main memory reference 100 ns
•Compress 1K w/cheap compression algorithm 3,000 ns
•Send 2K bytes over 1 Gbps network 20,000 ns
•Read 1 MB sequentially from memory 250,000 ns
•Round trip within same datacenter 500,000 ns
•Disk seek 10,000,000 ns
•Read 1 MB sequentially from disk 20,000,000 ns
•Send packet CA->Netherlands->CA 150,000,000 ns
5.
6. Some facts
• L1<<L2<<RAM<<Disk
• Sequential access is much faster than random
access (10 times+)
• Cheap Compression is faster than transfer
data on the network
• Gbps<Disk<100Mbps
Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
https://code.google.com/p/snappy/
7. Key to Performance- Improve memory
efficiency
Java is bad at memory efficiency:
int (4 bytes) -> Integer (16 bytes): always prefer
primary type, but map key must be Object
1M records, each record has 5 string fields: 82M
a. Use Map<Map<String, String>>: 706M
b. Use Map<String, String[]>: 495M
c. Use Map<String, byte[][]>: 292 M
d. Use ByteBuffer + Trove map: 92 M
http://java-performance.info/overview-of-memory-
saving-techniques-java/
8. Bloom Filter – Hash without value
Question: How to support
remove?
10. Data locality – Key to Performance
• On the cache level, CPU always request data at the
cache line boundary (64 bytes at once)
Place variables used by a same thread nearby
Place variables used by different threads at least 64
bytes apart (Java 8 introduced @Contended)
http://daniel.mitterdorfer.name/articles/2014/false-
sharing/
11. Data locality – Key to Performance
• On the memory and disk level, repeat using same
data set is faster due to warm cache
• On the disk level, sequential access is 10 times faster
than random access => write data sequentially in
blocks
Example: CommitLog, Big table row range
12. Data locality – Key to Performance
• On the network level, data locality means computing
data locally. Instead of moving data to computation,
moving computation to data. (CPU is faster than
network, so it’s cheaper than data)
13. Data Decoupling – key to Scalability
Modeling data in reader/writer perspective to eliminate hotspot
instead of group data conceptually
Example:
• Unlike many traditional file systems, GFS does not have a per-
directory data structure that lists all the files in that directory.
GFS logically represents its namespace as a lookup table
mapping full pathnames to metadata. (agent group, access
group vs. agent skills)
• column (family) based database.
Anti-pattern: User settings in CfgPerson
14. Data Decoupling – key to Scalability
Normalization or Denormalization? It’s a
question.
We are taught for decades Normalization is
good: Small size + Consistency
But, it makes strong data coupling => hard to
be scalable
15. Data Immutability – key to Scalability
• Always available, no contention
• Always consistent, no need to synchronize
• Can be replicated freely whenever needed
16. Data Immutability – key to Scalability
• Append instead of update (GFS)
• Merge instead of update (SSTable)
• Add tombstone instead of delete (Cassandra)
17. SSTable
• SSTable : immutable sorted string table, index table is always in
memory
• Merge to remove tombstone
18. SSTable (LSM-Tree)
• Commit Log (node): sequential write to maximize write throughput (vs B+ tree)
• SSTable (column family ): immutable sorted string table, index table is always in memory
• Merge to remove tombstone
19. Shared nothing architecture
• nodes are independent and self-sufficient
• no single point of contention across the
system
• The invention of DHT
21. Consistent Hash- two objects meet at
one keyspace
Karger (MIT, 2001 - Chord)
Cassandra,
MapReduce
22. HRW hashing
An alternative solution: hashing both data and
host, pick the best fit
w1 = h(S1, O), w2 = h(S2, O), ..., wn = h(Sn, O)
Winner: wO = max {w1, w2, ..., wn}
David Thaler and Chinya Ravishankar (University of
Michigan, 1996)
26. Apache Spark
• Developed by Berkeley AMPLab
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
• Resilient Distributed Datasets (RDDs)
27. Resilient Distributed Datasets (RDDs)
• Hadoop MapReduce is on the disk -> Slow
RDDs is a distributed memory model -> Fast
• Traditional distributed memory supports fine
grained updates -> No fault tolerance or need
extensive loggings or replications
RDDs are Immutable, created by coarse
grained transformations (map, join, filter) ->
quickly rebuilt
28. Other interesting algorithms
• HyperLogLog (cassandra)
•Skip List (lucene,Redis,levelDB)
•MurmurHash (google, cassandra)
•BallTree (google map)
•Fractal Tree(MySQL,mongoDB)
•Dynamic Time Warping
29. Check list
•Calculate performance in your design
•Estimate data size before you build it
•Good designs are always tailored
•Knows your tools (guava, gs collection,
protobuf, snappy…)
•Share with others