The document describes several radical steps in computer architecture that were implemented by the author's team before the industry, including carry save arithmetic in 1954 and the Elbrus computer architecture in 1978. It discusses drawbacks of current superscalar architectures and outlines the principles of a proposed "best possible computer system" with support for high-level programming, fine-grained parallelism, dynamic data types, and full security through capability features. Key aspects are a new universal programming language, a compiler with full algorithm and hardware information, and hardware designed for optimization rather than artificial constraints.
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
Implementation of Soft-core processor(PicoBlaze) on FPGA using Xilinx.
Establishing communication between two PicoBlaze processors.
Creating an application using the multi-core processor.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
Implementation of Soft-core processor(PicoBlaze) on FPGA using Xilinx.
Establishing communication between two PicoBlaze processors.
Creating an application using the multi-core processor.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
A review of the history of digital design throughout the years until the era of programmable logic, and a detailed exploration of the architecture of FPGA chips, followed by an introduction to SoC FPGAs and some of their benefits.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
A review of the history of digital design throughout the years until the era of programmable logic, and a detailed exploration of the architecture of FPGA chips, followed by an introduction to SoC FPGAs and some of their benefits.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;
assigning instructions to the functional units on the hardware;
determining when instructions are initiated placed together into a single word.
WIDER Annual Lecture 20 – Martin RavallionUNU-WIDER
Martin Ravallion’s WIDER Annual Lecture focused on the economic and political issues surrounding the use of direct interventions, such as cash transfers and in kind contributions, against poverty. He highlighted two key lessons that are important for policymakers to keep in mind when designing interventions. First, there is too much focus on how policies are targeted, and not enough attention on how effectively policies promote and protect. Second, policymakers should consider how to improve the protection-promotion tradeoff, and look for ways to design policies that allow markets to work better from the perspective of poor people.
Отечественные решения на базе SDN и NFV для телеком-операторовARCCN
Доклад Р.Л. Смелянского на секции "Инновационные информационно-телекоммуникационные технологии в вооруженных силах Российской Федерации. Программно-конфигурируемые сети (SDN). Области применения и особенности внедрения" Форума Армия-2016
Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchangeARCCN
Сергей Монин — руководитель Центра Тестирования решений в области SDN/NFV компании Центр прикладных исследований компьютерных сетей (ЦПИКС) с докладом «Практическое применение SDN/NFV в современных сетях: от CPE до Internet eXchange»
This talk (in Russian) is about RUNOS OpenFlow controller publicly available at https://github.com/ARCCN/runos. Feel free to contact me if you have questions.
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...ijceronline
Embedded systems used in real-time applications require low power, less area and a high computation speed. For digital signal processing (DSP), image processing and communication applications, data are often received at a continuously high rate. Embedded processors have to cope with this high data rate and process the incoming data based on specific application requirements. Even though there are many different application domains, they all require arithmetic operations that quickly compute the desired values using a larger range of operation, reconfigurable behavior, low power and high precision. The type of necessary arithmetic operations may vary greatly among different applications. The RTL-based design and verification of one or more of these functions may be time-consuming. Some High Level Synthesis tools reduce this design and verification time but may not be optimal or suitable for low power applications. The developed MATLAB-based Arithmetic Engine improves design time and reduces the verification process, but the key point is to use a unified design that combines some of the basic operations with more complex operations to reduce area and power consumption. The results indicate that using the Arithmetic Engine from a simple design to more complex systems can improve design time by reducing the verification time by up to 62%. The MATLAB-based Arithmetic Engine generates structural RTL code, a testbench, and gives the designers more control. The MATLAB-based design and verification engine uses optimized algorithms for better accuracy at a better throughput.
Great Paper on HSAemu Full system simulator built form PQUEMU to do Full System Emulation of HSA from our Academic Member Yeh-Ching Chung of National Tsing Hua University
Introduction to C to Hardware (programming FPGAs and CPLDs in C)Altium
This introduction talks about the basics of C to Hardware compilers, software used to generate hardware (in FPGA or CPLD devices, or potentially custom silicon chips), from the well-known and loved C programming language.
Most PCB level hardware design engineers are not familiar with Hardware Description Languages (HDLs) such as VHDL or Verilog. Learning a new language, and particularly the nuances of describing asynchronous and / or parallel and concurrent hardware logic circuits can be daunting.
However, most electronics designers and engineers are at least partially familiar with the C programming language, used for software development on PCs as well as embedded microcontroller systems.
This presentation shows how, with Altium's C-to-Hardware compiler technology, the C programming language can be used to generate parallelized, accelerated hardware in FPGA devices.
SequenceL gets rid of decades of programming baggageDoug Norton
Why the computer industry is overdue to stop mimicking single step CPU hardware. And why the auto-parallelizing SequenceL is the best way to program multicore and many-core hardware platforms.
you can be friend with me on orkut
"mangalforyou@gmail.com" : i belive in sharing the knowledge so please send project reports ,seminar and ppt. to me .
Магистерская программа «Распределённые системы и компьютерные сети»ARCCN
Доклад Смелянского Руслана Леонидовича, чл.-корр. РАН, профессор, д.ф.-м-.н., МГУ им. М.В.Ломоносова, факультет ВМК, Кафедра АСВК, Лаборатория Вычислительных Комплексов, на ежегодном заседании участников Консорциума университетов по SDN технологиям, май 2017
Основные направления развития ФГБОУ ВО «РГРТУ» в области программно-конфигури...ARCCN
Доклад Перепелкина Дмитрия, к.т.н., доцент кафедры САПР ВС Рязанский государственный радиотехнический университет, Центр инновационных сетевых и облачных технологий, на ежегодном заседании Участников Косорциума университетов по SDN технологиям, май 2017 года
Методика стратегического управления развитием SDN&NFV-сети оператора связи и ...ARCCN
Доклад Александра Герасимова и Евгения Лашманова, Московский государственный технический университет радиотехники, электроники и автоматики (МГТУ МИРЭА), на ежегодном собрании участников Консорциума по SDN технологиям, мая 2017 года
Возможности импортозамещения коммутационного оборудования в сетях нового пок...ARCCN
Доклад Вячеслава Васина, директор департамента сетевых решений ЦПИКС, на конференции "Локализация производства и импортозамещение коммуникационного и радиоэлектронного оборудования, приборов и устройств для ИКТ отрасли России", 15 февраля 2017 года.
Доклад Виталия Антоненко (ЦПИКС) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Типовые сервисы региональной сети передачи данныхARCCN
Доклад Вячеслава Васина (ЦПИКС) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Разработка OpenFlow-коммутатора на базе сетевого процессора EZchipARCCN
Доклад Васина Вячеслава (ЦПИКС) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Исследования SDN в Оренбургском государственном университете: сетевая безопас...ARCCN
Доклад Шухмана А.Е. (ОГУ) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Цели и задачи МИЭТ, как участника Консорциума на примере кафедры "Телекоммуни...ARCCN
Доклад Бахтина А.А. (МИЭТ) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Доклад Садова О.Л. (ИТМО) на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Доклад Смелянского Р.Л. на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Учебно-методическая работа по тематике ПКС и ВССARCCN
Доклад Смелянского Р.Л. на семинаре Консорциума университетов по изучению и развитию передовых технологий в сфере компьютерных сетей. 20 октября 2016 года
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
2. Nearly all basic radical steps in architecture were
made by our team before anybody in industry
• “Carry save arithmetic” – one of the two basic technologies still in use for
main arithmetic primitive operations
– my student’s work (1954), presented at university conference (1955).
• The best possible architecture functionality definition and
implementation in Elbrus computer (1978) widely used in our country
including
– High level programming architecture support (not just support of the existing
HLL corrupted by outdated architecture) – without parallel execution
functionality (HW of that time was not ready for that)
not implemented so far in any existing computers
– Real HLL EL – 76 (1976) for Elbrus computers
– Clean best possible OS kernel (no privilege mode) for supporting real High
Level programming
• Elbrus architecture, which main goal is a real HLL EL – 76, and Elbrus OS
kernel as a byproduct, fully solved security problem including possibility
of supporting user programs’ correctness proof.
3. OUR RADICAL STEPS (first in industry)
(cont.)
• The very first-in-technology implementation of OOO superscalar (Elbrus 1 – 1978)
and what is even more important at the early stage (after the second generation of
Elbrus computers in 1985) getting rid of superscalar approach showing its weak
points and starting to find more robust solution of parallel execution problem.
• Successful implementation of cluster-based VLIW architecture with fine grained
parallel execution (Elbrus 3, end of 90s), probably for the first time in technology.
• Suggestion and the fist implementation of Binary Translation (BT) technology for
designing a new architecture built on radically new principles but binary
compatible with the old ones (Elbrus 3, end of 90s).
• Design and simulation of radically new principles of fine grained parallel
architecture and extension of HLL (like EL – 76) and OS (like Elbrus OS kernels) for
their support.
5. Drawbacks of current superscalar (SS)
• Program conversion in SS is rather complicated.
Parallel algorithm sequential binary implicitly parallel inside SS sequential at retirement
• SS has performance limit (independent of available HW).
• Inability to use all available HW properly.
• Funny situation exists with SMT mechanism using SMT instead of using natural algorithm parallelism.
• Rather complicated VECTOR HW and MULTI-THREAD programming.
• Current architecture corrupted all today’s HLLs.
• Current architecture does not support dynamic data typing and object oriented data memory.
This excludes possibility to support good security and debugging facility.
• Current organization of computations does not allow good optimization.
Compiler has no full information about algorithm and HW (corrupted HLL).
Cache structure of today’s architecture hides its internal structure preventing compiler from good
optimization of its operation.
• Today’s architecture is far from being universal.
• Etc.
An extremely important point here is that
all the above-mentioned drawbacks (including HLL, OS) have a single source –
inheriting of principles of ancient, early days computing with strong HW size constraints for
current architecture as its basic ones.
6. EARLY DAY’S COMPUTING
Main constraint – shortage of HW single execution unit EU and small linear memory
Execution unit was un-improvable
Carry cave and high radix arithmetic
Therefore, the whole architecture was un-improvable and universal
with said constraints
Basic architecture decisions
Single Instruction Pointer binary (SIP)
Simple unstructured linear memory (LM)
No data types support (No DT)
Binary was the sequence (SIP) of instructions for the main resource - single EU
Argument of instructions – address of another resource – memory location (LM)
No any data type support (No DT) – shortage of resources
All execution optimization was programmer’s job, he knows algorithm and HW
resources well. At that time both algorithms to be executed and HW were rather
simple, so programmer was able to do his job very well
Input binary includes instructions how to use resources,
rather than the algorithm description.
Design was best possible for those constraints.
7. SUPERSCALAR (SS)
With SS the situation became different:
• No HW size constraint
• The main constraint is requirement of user level compatibility with old
computer
(SIP, LM, No Dynamic data Types)
• Program size, HW complexity and optimization job became very big
Many drawbacks of superscalar presented above can be split in two areas:
• Bad functionality (semantics of data and operations)
Without supporting dynamic data types in HW it is impossible to correct this drawback.
It is impossible to support real high level programming and full security.
8. SUPERSCALAR (SS) (cont.)
• Bad performance
In SS optimization is executed by programmer, language compiler and HW.
Programmer
• Now it is too complicated for him and he doesn't know complicated HW
• Due to corrupted HLL he cannot specify results of optimization correctly.
Compiler Optimization is the right job for it (for it only),
but there are no good conditions for that in SS
• Due to corrupted HLL compiler has no full information about algorithm
• Now compiler is not local to model – it has no enough info about model
HW as well including cache structure, which is hidden from compiler for
compatibility reasons.
HW (BPU, prefetching, eviction) it is a wrong job for it
HW has no algorithm information
HW structure is not adjusted for algorithm structure (“artificial binding”)
9. BEST POSSIBLE COMPUTER SYSTEM
Radical step for Best Possible System (BPS) should
move the design into a strongly opposite extreme –
from resources to algorithms care
Two BPS systems will be discussed.
UNCONSTRAINED BPS with the only constraint –
algorithm limitation and specific model HW resources size
CONSTRAINED BPS with previous constraints
plus user level compatibility with x86 (or ARM, etc.)
All mechanisms designed for unconstrained BPS are best possible and should
be used as basic in constrained BPS. Besides, a few mechanisms should be
added for compatibility support.
For this the following requirements should be satisfied for language, compiler
and HW for unconstrained BPS
10. New language for BPS
Compiler should have full information about algorithm.
That means that algorithm should be presented in a new universal language
that is not corrupted by old architectures.
Programmer’s job is to optimize algorithm only, but not its execution.
His responsibility is only to give full information about algorithm to compiler.
This language should have at least three important features:
• Support of presentation of fine grained parallel algorithms (parallelism)
• The right functionality (semantics) of its elements including dynamic
data typing and capability feature
• Possibility to present exhaustive information about algorithm
The second feature is completely implemented in EL-76 language used in
several generations of computers in our country.
11. COMPILER for BPS
Only compiler can and should do optimization in BPS ,
but it should have the following good conditions for that:
• It should have full information about algorithm
Programmer should give it using the new language
• It should have full information about HW model
Compiler should be local to the HW model
Distributable binary should be just a simple recoding of new HLL
without any optimizations
Compiler will use some dynamic information from execution to be
able to tune optimization dynamically
• The structure of HW elements should be suitable for good optimization
control by compiler (see next slide).
Local to model compiler removes compatibility requirements from HW,
because local compiler receives binary and, if needed for HW improvement, it
can be changed together with compiler.
12. HW requirements for BPS
HW in BPS should not do any optimizations (BPU, prefetching, eviction, etc.) –
it cannot do this good enough, it has no algorithm info and cannot do
complex reasoning at run time for analysis.
It should do resources allocation according to compiler instruction.
The main point here is that HW structure should avoid “artificial binding” (AB)
like SIP, Cache line, Vectors in AVX, Full virtual pages, etc.
The data structure in HW should not contradict to that of algorithm.
The data in HW should be like Lego Set, which will allow compiler to do
restructuring for optimization.
The BPS should use Elbrus like object oriented memory structure.
13. CONSTRAINED BPS
All past architectures reach un-improvable state for their constraints. This is
true for current SS as well.
Therefore, at least relaxation of current constraints, with retaining user level
ISA compatibility (x86, ARM, etc.), is an absolutely necessary condition to step
forward and build constrained BPS.
We cannot change semantics of current ISA. The only possibility is to change
binary presentation by means of BT.
So, the only possible step forward for constrained computer architecture is
usage of BT system.
With BT constrained BPS will use all mechanisms of unconstrained BPS
with adding three more mechanisms to support basic compatibility
requirements (SIP, LM). These mechanisms are:
• Retirement
• Check Point
• Memory Lock Table
Unfortunately, for semantics compatibility reasons constrained BPS cannot
support security and aggressive procedure level parallelization.
15. In constrained architecture functionality (semantics) of all its elements (data and
operations) is strongly determined by compatibility requirements
In this section we are going to present the main functional features of unconstrained
computer system and its elements, which were developed in accordance with the
approach described above.
All mechanisms implementation good for both constrained and unconstrained
systems will be the subject of the following sections.
Primitive data types and operations
Besides the traditional ones (integer, FP, etc.) they include
Data and Functional Descriptors – DD and FD – references to object and procedure
DYNAMIC PRIMITIVE DATA TYPES
For primitive data HW supports data types together with values dynamically (with
TAGs).
TYPE SAFETY APPROACH
All primitive operations are checking types of their arguments.
16. User defined data types (objects) functionality
“Natural” requirements to the mechanism of user defined data types
(objects) and their implementation
1) Every procedure can generate a new data object and receive a reference (DD) to this new object
2) This procedure, using the received reference, can do with this new object anything possible:
– Read data from this object
– Read full constant only
– Update any element
– Delete this object
3) No other procedure can access this object just after it has been generated, but this procedure
can give a reference to this object to any objects it knows (has a reference to it) with all or
decreased rights listed above
4) Any procedure can generate a copy of reference to any object it knows maybe with decreased
rights
5) After the object has been deleted, nobody can access it (all existing references are obsolete)
This “natural” description quite uniquely identifies rather simple HW implementation with very high
overall execution efficiency (compared with traditional systems).
17. User defined data types (cont.)
Object can have user defined Object Type Name (OTN). OTN is also primitive
data allocated to objet by its creator.
Primitive HW operations check types of their arguments.
Procedure also can check type of any object it is working with.
Compaction algorithm - dangling pointer problem efficient solution
(compared with less efficient Garbage Collection GC) was developed in Elbrus
computer. It should be used in unconstrained BPS.
With this approach, user (similarly to existing systems) explicitly kills the
already used object, which (unlike GC) immediately frees physical (but,
unfortunately, not virtual) memory.
When virtual memory is close to overflow, background compaction algorithm
searches the whole memory sequentially, deleting DD of killed objects and
decrementing virtual memory value of still alive objects, which results in
compacted virtual memory and possibility to reuse all virtual memory freed
from killed objects.
18. Procedure mechanism (user defined operations)
Here also we would like to discuss the first “natural” requirement to procedure construction to
support language level functionality consistent with the “abstract algorithm” ideas.
1) Any procedure can define another procedure, and define any information accessible to the
original procedure as global data for the new procedure. In real running program the only
thing to do for definition of the new procedure is to generate (this special instruction in ISA)
Functional Descriptor (FD), which allows calling this new procedure.
2) Procedure, which generated this FD, can give this new FD to anybody it has access to, and this
new owner also can call this new procedure (only call without access to its global data,
executable code, etc., which can be used by the called procedure only).
3) Procedure, which generates FD, includes in FD virtual address of the code to be executed by
the new procedure, when this procedure will be called, and this procedure also includes in FD
a virtual address of global data object, which can be used by instructions of the new called
procedure. Therefore both references are included into FD (a reference to code and a
reference to global data)
4) Any procedure, which has FD of the new procedure, can call this procedure and can give it
some parameters. Parameter passing logically is an atomic step – the new procedure does
not work (no one instruction of the called procedure is executed) before caller specifies all
parameters; caller has no access to the parameters passed to callee after call is executed
5) Caller can receive some return data as a result of procedure execution. These data can be
used by caller code. Here also we have atomic return value passing
19. Procedure mechanism (user defined operations) (cont.)
An extremely important notion for procedure is procedure context – this is the
only set of data, which the called procedure can use. The called procedure can
use nothing besides the procedure context.
Procedure context includes:
• Global data given to procedure by creator procedure
• Parameters data from caller
• All data returned to procedure by procedures called by this procedure.
Procedure restriction for context only access is the result of HW architecture
features
• Dynamic data type and primitive operations type safety support
• Strong support semantics of references (DD and FD)
This is foundation of capability technology, which ensures strong inter procedure
protection.
Implementation of all these features in HW is a rather simple and efficient job.
20. Full solution of security problem
Strong inter procedure protection ensures that no any attacker can corrupt
functioning of system SW (if it has no internal mistakes) and model HW.
Attacker cannot access any system data as a result of capability feature, just
because attacker never will have any references (DD or FD) to system data.
Nobody can sent it to him and he is unable to “create” it artificially.
He is also unable to do something bad without real references to system SW.
However, now a lot of security problems are results of possibility to use mistakes
in user programs by attacker, which he is working with.
Logically, the only remedy here is possibility to use a well developed technology of
program correctness proof.
However, with todays architecture (x86, ARM, etc.) even procedure without any
mistakes can be corrupted by attacker due to imperfect old architecture.
This is not the case with capability system and correctness proof gives reliable
result.
Presented approach fully solves security problem.
This technology was fully implemented in Elbrus computer about 40 years ago.
Unfortunately, nobody till now is even close to this solution.
22. Object oriented memory (OOM)
OOM was designed and used in two generations of Elbrus
computer with good results. Unfortunately, at that time there
was no requirement for cache. But now it can be easily
extended into cache. Current Narch design was made on
traditional memory and cache structure. However, this
memory structure doesn’t correspond to above philosophy.
OOM design can be used in full degree on unconstrained BPS.
Unfortunately, it cannot be used for memory system of
constrained BPS (Narch) due to compatibility reason.
However, it can be used in its cache system.
OOM structure even for constrained BPS according to
preliminary estimations can decrease cache sizes by up to 2-3
times and nearly exclude performance losses due to cache
misses.
23. Object oriented memory (OOM) implementation
Organizations of physical memory and all cache levels, in general, are the same. The
following description is related to all of them.
The size of physical memory allocated for an object is equal to the object size. However,
each allocated object is also loaded in the virtual space. This space has fixed size pages. For
each new object virtual space is allocated from the beginning of a new page. If size of the
object is smaller than the page size, then the end of the virtual space of this page is empty
(not used). If object is bigger than the virtual page, then a number of pages are allocated
for it and the last one can be not fully used.
One of the main results of this organization is that each page can include data of one object
only. Any page can never include data of more than one object. All free space is explicitly
visible for HW and compiler (no “artificial binding”).
In memory, as well as in caches, an arbitrary physical part of the object can be allocated (by
compiler local to model) in some specific cache.
All physical space (of variable size) both in memory and in any cache levels is allocated
dynamically. Therefore, the whole free space is in high degree fragmented. Therefore, it is
very difficult, if possible at all sometimes, to allocate a rather big piece of an object.
We split object into pages to cope successfully with this problem.
However, for cache level even page size is big enough from this viewpoint.
Therefore, parts of object allocated at cache levels are split by local to model compiler even
into smaller parts (all these parts are a part of the same virtual page).
24. Object oriented memory (OOM) implementation (cont.)
The system supports special lists for all free spaces. Each list keeps the free
areas of a certain set of the sizes (more likely, of power of 2).
Each free area is listed in one of the bidirectional lists through the first word
of this free piece.
Actually, OOM uses virtual numbers of the objects instead of virtual memory
addresses. Therefore, in the case of object with the size of many pages, all its
pages will have the same virtual object number. Full identification of a specific
element of the object will include virtual object number and its index inside
object. However, descriptor includes virtual object number only.
In OOM each object should not be necessarily presented in memory. Some
objects can be generated, for example, in Level 1 cache only or in other levels
of caches.
25. Object oriented memory (OOM) implementation (cont.)
This memory/cache system organization allows stronger compiler control on
execution.
Compiler knows all program semantics information and does a more
sophisticated optimization.
Compiler can do preload of the needed data to high cache level, at first
without appointing a more valuable register memory, and moving this data
from cache to register only at the last moment. But now even preloading
directly into register sometimes could be a good alternative – now we have a
big register file.
This cache organization allows using access to the first level cache directly
from instruction by physical addresses without using virtual address and
associative search.
To do this, base register (BR) can support a special mode, in which it includes
pointers to the physical location of the first level cache together with its
virtual address.
26. Procedure mechanism (implementation)
In the past we used “strands” approach for this implementation. While “strand”
approach is substantially better than superscalar, it still allows dramatic
improvement.
In strand implementation each strand is HW resource. Parallelism level of
dynamically executed program is varying depending on resources dynamic
situation, therefore, execution should be able dynamically fork a new strand,
which requires a new resource.
Typically, for such a situation deadlock avoidance problem should be solved. Static
solution of this problem decreases performance. This is less dangerous for loops,
because loops can be executed nearly without stopping and forking strands.
However, it is not so good for scalar code.
Here we will discuss a substantially more advanced suggestion, which is good for
scalar and increases performance for loops as well. It can be used in constrained
BPS (Narch) as well.
This will improve already declared performance data for Narch.
27. Procedure mechanism (implementation) (cont.)
For new approach, code to be executed is presented as a fine grained parallel
graph with instructions in its nodes and dependencies presented by arcs of the
graph.
Compiler splits this graph into a number of streams similar to strands in current
implementation.
Instead of frontend in current design, the new approach has code buffer for whole
graph (not for separate streams) only.
Four basic technologies are used here:
• Register allocation, which is not so trivial in the case of fine grained dynamic
code execution.
• Speculative execution (control and data speculation) – same as today in Narch
• Dynamic execution of parallel instruction graph by “workers”
• Instruction graph loading into instruction buffer
28. Register allocation
DL/CL technology
Scalar code (streams) graph can be crossed both by DLs and CLs lines. Code can have
several DLs and CLs, each having a corresponding number – DLn and CLn.
All instructions, which cross DL or CL, include this information and HW knows when a
specific line was crossed.
When some DLn was crossed that means that some register WEBs are already free (all
reads and writes are finished) and can be reused. The registers, which were freed with
DLn, can be used by compiler in instructions after corresponding CLn. Therefore,
corresponding CLn also can be crossed by corresponding streams.
If some instruction marked by CLn is being executed and corresponding DLn is not
crossed yet, this instruction will wait until this happens.
Program will be executed normally, but the time of execution can be improved.
Dynamic feedback (in HW) collects information whether any CL was waiting, and using
this information compiler later can recompile procedure lifting a corresponding DL a
little bit. Eventually, program will work without any time losses for CL wait.
29. Speculative execution (control and data)
Branch execution in the new approach is similar to the previous one.
BT compiler in constrained and high level language compiler in unconstrained
versions generate fine grained parallel binary for HW.
Unlike superscalar with BPU technology, when all branches are critical and
need predicted speculative execution for each branch with performance
losses in case of miss prediction, in our case, due to explicitly parallel
execution according to our statistics 80% of branches are not critical and can
be executed without speculation.
Even with critical branches in our case, when predicate is known well ahead,
or has very strong prediction by compiler, there is no need in speculation.
Critical branches with late predicate and bad compiler prediction should
execute speculatively both alternatives, until predicate is known.
As a result, in our case we have no performance losses for branches at all.
Similar situation is with data speculation.
30. Dynamic execution of parallel instruction graph by “workers”
For constrained architecture, compiler will do all decoding itself instead of
HW, therefore, each instruction on the code is ready to be loaded into the
corresponding execution unit. For unconstrained case, each instruction also
will not need any decoding.
For each instruction compiler will calculate “Priority Value Number” (PVN).
This number is the number of clocks from this instruction up to the end of
scalar code along the longest path. Compiler will present the code in a
number of dependent instruction sequences - “streams” (similar to strands in
previous design).
In this architecture, from the very beginning processor will execute not the
“single instruction pointer” sequential code, but the whole graph of the
algorithm - all streams with explicitly parallel structure visible to HW.
To make it possible processor, besides register file, includes the code buffer
The new technology removes frontend from HW entirely. There are many
other advantages for this step as well.
As code will be executed in fine grained parallel mode, each register should
have EMPTY/FULL (E/F) bit to prevent reading from empty register and ask
reading instruction to wait until result is assigned.
31. Dynamic execution of parallel instruction graph by “workers” (cont.)
Our engine has a number of “workers” in each cluster, whose job is to take the
next instructions from the most important streams and to allocate them to a
corresponding execution unit.
The number of workers in each cluster should be enough to make all execution
units busy each clock.
Our preliminary guess is that each cluster should have about 16 workers.
It loads into Reservation Station (RS) a candidate instruction, which is ready to be
executed (all argument registers are FULL or instruction, which should generate its
value, has already been sent into RS – needs yet another bit (RS) in each register)
and destination is EMPTY).
Besides E/F and RS bits each register has (one byte) the head of the line of the
streams, which are waiting for the result to be written into this register from some
other stream.
If at least one argument of the next instruction to be allocated is not ready,
worker stops working with this stream and puts this stream into one directional
line of one of the registers, which is not ready. This work requires two register
assignments, which can be done in parallel, however, at this point worker is free
of work anyway, and it is searching any other stream ready to be handled.
32. Loading of instruction graph into instruction buffer
DL/CL technology helps to solve big code problem.
For code buffer, it is necessary to have its extension. When code is
executed before CLn, it is necessary to upload the next part of the code
between CLn and CLn+k. Similarly, when DLn is crossed, all code area
above can be free.
The size of code between CLn and CLn+k is not bigger than the size of
register file.
33. Example: Structure of Recurrent Loop Dependencies
Use loop iteration parallelism (both iteration internal and
inter-iteration) as fully as possible
Loop iterations analysis performed by the compiler:
– Find instructions, which are self-dependent over iteration
– Find the groups of instructions, which being self-dependent,
are also mutually dependent over the iterations (“rings” of data
dependency)
– The rest of instructions create sequences, or graph of
dependent instructions (a number of “rows”)
– The result of each row is either an output of the iteration
(STORE, for example), or is used by another row(s) or ring(s).
Each “ring” and/or “row” loop is producing data, which are
consumed by other small loops. Each producer can have a
number of consumers. However, producer and consumer
should be connected through a buffer, giving possibility for
producer to go forward, if consumer is not ready yet to use
these data
34. 1. Primitive data types and operations
introduction
2.1 User defined data types
functionality Objects introduction
2.2 User defined operations
functionality Procedure Introduction
4.1 User structural data architecture
support
Object oriented memory
implementation 4.2.1 intra (fine grained) & inter
procedure execution parallelism
architecture implementation
4.1.1 To be extended to cache
4.2 User ”operations” procedure
implementation
2.2.1 intra & inter proc parallelism
3. “New” HLL introduction
3.0.1 parallelism
5. “New” OS kernel
introduction
Basic components of computer technology, their current state
and our involvement in their implementation
35. 35
Green parts of computer technology were fully
implemented by our (ELBRUS) team in real design
(1978) before anybody else in technology
Yellow parts require moderate extensions of some
of green technologies to support fine grained
parallelism
Red part is introduction of intra & (fine grained)
inter procedure parallelism. All basic decisions are
well developed, need to be implemented in real
design.
36. The block diagram above includes all basic parts of computer
technology and indicates their current states.
1. Introduction of primitive data types and operations
Implementation of arithmetic highlighted in green – over 60 years ago this
implementation reached the un-improvable state
Carry save algorithm – my student’s work in 1954 – university presentation in
1955. The first western publication in 1956.
High radix arithmetic – James E. Robertson, mid 50s. I had a meeting with him in
Moscow in 1958.
2.x Introduction of functionality of user defined data types (Objects) & operations
(Procedure)
This functionality must be defined with the main and maybe the only basic goal:
• To fully correspond to the natural meaning of these notions, without corruption
by trying to do any optimization, security or other goals.
• If this job is not constrained by any compatibility requirements (especially, with
early day’s architecture), this approach ensures the best possible byproduct
results for all these goals.
This problem was fully solved in Elbrus architecture (1978) and showed
outstanding results in two generations of computers widely used in our country.
Though it is difficult to prove this theoretically, however, it is rather evident that
this approach is the best possible, just because the above goal (natural meaning
of functional elements) has the only solution.
37. 2.2.1 Intra & inter proc parallelism
Procedure definitions should be extended by intra (fine grained) & inter procedure
parallel execution semantics.
It was not possible to implement this in Elbrus time, because HW was unable to
support this.
This is part of the work to be done on parallel architecture implementation.
All basic approaches now have been already suggested in our team.
3. “New” HLL introduction
3.0.1 HLL parallelism extension
We have already implemented a New language for such a design in Elbrus (EL – 76)
According to the declared general design principle this language should be (and is)
with dynamic data types and with type safety approach.
It should be extended by parallel semantics.
4.1 Object oriented memory implementation
Unlike superscalar memory and cache organization, object oriented memory allows
to do efficient optimization for local to model compiler.
Object oriented memory is fully implemented in Elbrus.
38. 4.1.1 To be extended to cache.
In Elbrus time there was no need to use caches.
All suggestions in this area have been already made.
4.2 Procedure implementation
For advanced architecture procedure is a highly important feature.
Elbrus made a very clean functional implementation of procedure. The basic result is highly modular
programming with strong inter-procedure protection.
This is also clean and best possible implementation.
The main design step to be done here is its extension for intra & inter procedure parallelism support.
4.2.1 Intra (fine grained) & inter procedure execution parallelism implementation
These are the main design efforts for finishing design of the best possible architecture.
Only about 10 year progress of silicon technology required and made it possible to implement a radical
parallel architecture.
Our team has reached this point with big past experience in this area.
The industry-first real OOO superscalar (Elbrus 1, 2) 1978
Even more important is that we found out that it is not the best approach and got rid of it after the
second generation (Elbrus 2) 1985
VLIW (Elbrus 3) with the first successful cluster ~2000
Strands (already in Intel) 2007- 2013
Clean loop implementation based on strands 2007 - 2013
All these approaches, while reaching good results, are not the best possible (including strands)
Now we have suggested a radical improvement close to Data Flow both for scalar and for loops (also
looks like for the first time in industry).
39. 5. “New” OS kernel introduction
Elbrus 1, 2 are the first and the best possible full implementation of this
technology. Due to basic principles Elbrus did not need to use privileged
mode programming even in OS kernel.
OS kernel implementation having the same functionality is about four
times simpler (smaller in size) compared with today’s OSs and can be
implemented in application mode only.
40. Results
• Elbrus, Narch and Narch+ are made strongly according to the approach
presented in this paper. The results are impressive. These are the results
of work and application of widely used architecture Elbrus 1, 2, 3 and
detailed simulation of future design.
• This approach allows implementation of architecture unconstrained by
any compatibility restriction (Narch+) or compatible with one of existing
architectures - x86, ARM, POWER, etc., or even with all of them together
in one HW model with BT – (Narch).
41. Main results over most powerful Intel processors:
Narch
• Extremely high performance both in Single Job or MT applications –
unreachable for any existing architectures, maybe can reach an absolute un-
improvable level
already shown on detailed simulation,
before introduction of all performance mechanisms
2x+ on ST
2x on MT with the same area
After finishing of debugging
3x – 4x on ST
2,5x – 3x on MT with the same area
• Substantially more power efficiency and less area with the same performance
20% - 30% power efficiency
60% area
• More simpler architecture design
• Un-improvable for any current architecture, fully compatible with x86 or ARM
or any other current architectures.
42. Main results over most powerful Intel processors:
Narch+
• Performance is many tens of times higher both for ST and for MT
• Extremely simple and power efficient
• Substantially simpler and more reliable SW debugging (according to Elbrus
experience – by 10 times)
• Full solution of security problem both for HW, OS and for user programs
(with correctness proof) – all attackers will be jobless
• Really universal, this is a rather important feature. No one architecture after
the very first vacuum tube computer has this characteristic.
It is very likely that after Narch+ introduction (if this happens), it will not be
necessary to design a myriad of specialized architectures like graphics, computer
vision, machine learning and so on.
Narch+ will be absolutely un-improvable architecture nearly after the very first
design.