Sync on TAP - Syncing infrastructure with softwareADVA
At ITSF 2022, Ken Hann discussed our new OSA SoftSync™ client software, which adds advanced PTP timing technology to server-hosted applications, and the OSA 5400 M.2, which brings precise timing to COTs servers or even virtual machines. The session outlined how, for a low total cost of ownership, OSA SoftSync™ creates an end-to-end synchronization architecture able to meet the needs of advanced time-critical services. This will be invaluable for 5G, financial institutions, data center networks and other time-critical applications.
Technical Overview of Cisco Catalyst 9200 Series SwitchesRobb Boyd
TechWiseTV's Cisco Container Platform live workshop took place on July 18th.
For the first time in the industry, a single family of fixed, stackable, and modular switches are running on the same IOS-XE operating system along with a common ASIC.
Cisco’s Catalyst 9200 rounds out the lower end of its incredible Catalyst 9000 family of switches. The 9200 is designed for small, medium, and branch deployments, providing greater modularity, redundancy, and stackability than the Catalyst 2960 it replaces.
Register now.
Beginners: 5G Terminology (Updated - Feb 2019)3G4G
An updated short presentation and video looking at 5G terminology that is being used in 3GPP standards and specifications.
Terms such as NG-RAN, NR, ng-eNB, en-gNB, RIT, SRIT, Option 3, etc. will be discussed
Beginners: Energy Consumption in Mobile Networks - RAN Power Saving Schemes3G4G
This tutorial looks at energy consumption in the mobile networks, especially 4G and 5G and looks at various ways in which the vendors and standards are working on to reduce the power consumption.
At a high level, there are three layers of optimisation: Network level, Site level and Equipment level. This presentation looks at some of the ways the optimisation is achieved.
There is a long list of references available for anyone interested in researching this topic further.
All our #3G4G5G slides and videos are available at:
Videos: https://www.youtube.com/3G4G5G
Slides: https://www.slideshare.net/3G4GLtd
5G Page: https://www.3g4g.co.uk/5G/
Free Training Videos: https://www.3g4g.co.uk/Training/
Sync on TAP - Syncing infrastructure with softwareADVA
At ITSF 2022, Ken Hann discussed our new OSA SoftSync™ client software, which adds advanced PTP timing technology to server-hosted applications, and the OSA 5400 M.2, which brings precise timing to COTs servers or even virtual machines. The session outlined how, for a low total cost of ownership, OSA SoftSync™ creates an end-to-end synchronization architecture able to meet the needs of advanced time-critical services. This will be invaluable for 5G, financial institutions, data center networks and other time-critical applications.
Technical Overview of Cisco Catalyst 9200 Series SwitchesRobb Boyd
TechWiseTV's Cisco Container Platform live workshop took place on July 18th.
For the first time in the industry, a single family of fixed, stackable, and modular switches are running on the same IOS-XE operating system along with a common ASIC.
Cisco’s Catalyst 9200 rounds out the lower end of its incredible Catalyst 9000 family of switches. The 9200 is designed for small, medium, and branch deployments, providing greater modularity, redundancy, and stackability than the Catalyst 2960 it replaces.
Register now.
Beginners: 5G Terminology (Updated - Feb 2019)3G4G
An updated short presentation and video looking at 5G terminology that is being used in 3GPP standards and specifications.
Terms such as NG-RAN, NR, ng-eNB, en-gNB, RIT, SRIT, Option 3, etc. will be discussed
Beginners: Energy Consumption in Mobile Networks - RAN Power Saving Schemes3G4G
This tutorial looks at energy consumption in the mobile networks, especially 4G and 5G and looks at various ways in which the vendors and standards are working on to reduce the power consumption.
At a high level, there are three layers of optimisation: Network level, Site level and Equipment level. This presentation looks at some of the ways the optimisation is achieved.
There is a long list of references available for anyone interested in researching this topic further.
All our #3G4G5G slides and videos are available at:
Videos: https://www.youtube.com/3G4G5G
Slides: https://www.slideshare.net/3G4GLtd
5G Page: https://www.3g4g.co.uk/5G/
Free Training Videos: https://www.3g4g.co.uk/Training/
LTE is a common standard covering both FDD and TDD flavors, enableing the industry to build common FDD/TDD infrastructure, common devices, and a large common ecosystem. LTE and its evolution LTE Advanced play a critical role in addressing the 1000x increase in mobile data.
Qualcomm has been leading LTE proliferation from the very beginning— from the industry-first Gobi LTE/3G multimode, common FDD/TDD modems to the current third-generation solutions that powered the world’s first LTE Advanced carrier-aggregation launch in June 2013.
For more information please visit www.qualcomm.com/lte
Download the presentation here: http://www.qualcomm.com/media/documents/lte-qualcomm-leading-global-success
This updated presentation/video looks at 5G Network Architecture options that have been proposed by 3GPP for deployment of 5G. It covers the Standalone (SA) and Non-Standalone (NSA) architecture. In the NSA architecture, EN-DC (E-UTRA-NR Dual Connectivity), NGEN-DC (NG-RAN E-UTRA-NR Dual Connectivity) and NE-DC (NR-E-UTRA Dual Connectivity) has been looked at. Finally, migration strategies proposed by vendors and operators (MNOs / SPs) have been discussed.
5G Network Architecture, Design and Optimisation3G4G
Presented by Prof. Andy Sutton, Principal Network Architect, Architecture & Strategy, TSO, BT at The IET '5G - State of Play' conference on 24th January 2018
*** SHARED WITH PERMISSION ***
A quick look at 5G System architecture in Reference point representation and in Service Based representation and also look at the different Network Functions (NFs) within the 5G System.
6G: Potential Use Cases and Enabling Technologies3G4G
This white paper presents an overview of some of the promising applications and use case envisioned for 6G, with the objective to highlight the potential for new markets and to provide an indication of the expected technical requirements. The white paper then describes some of the enabling technologies for meeting the performance requirements of 6G.
Authors: Ritvik Gupta, Student (A-Levels), Sutton Grammar School, London, United Kingdom under the supervision of Dr Biplab Sidkar, Associate Professor , Dept of Electrical and Computer Engineering, National University of Singapore
Comparison of SRv6 Extensions uSID, SRv6+, C-SRHKentaro Ebisawa
Comparing concept, SID and header format of compressed Segment Routing IPv6 proposals such as uSID, SRv6+, C-SRH. Slide presented at SRv6 Consortium @Tokyo on 23rd Aug 2019.
Call Home is an innovative system that enables network functions virtualization (NFV) and software-defined networking (SDN) in situations where virtual customer premises equipment (vCPE) is protected by a cable modem or a firewall. In this presentation, ADVA Optical Networking’s engineers outlined this new technique and explained how it makes secure NFV/SDN deployment possible when a NETCONF client is otherwise unable to initiate an SSH connection directly to the NETCONF server.
Rajendra Nagabhushan and Vikram Darsi discussed how the IETF draft for NETCONF Call Home Using SSH can be implemented as an OpenDaylight feature. They demonstrated how the technology can be applied in a real-world use-case and outlined how an ADVA Optical Networking product is being developed ahead of its 2017 release.
Description and comparison of 3G, 4G and 5G Core Networks. You can find my detailed report in https://medium.com/@sarpkoksal/core-network-evolution-3g-vs-4g-vs-5g-7738267503c7
Ericsson’s proprietary Lean Carrier innovation is first to address intercell signaling interference, introducing lean design concepts to 4G LTE to improve data speed and app coverage for users while on the road to 5G.
Description: In this presentation / video we look at the argument why 5G NSA Option 3 (EN-DC) may not be around for a very long time and may be turned off by the operators when 5G Core is up and running and all the initial issues have been sorted out.
Video link: https://youtu.be/ERnq8WLlse0
All our #3G4G5G slides and videos are available at:
Videos: https://www.youtube.com/3G4G5G
Slides: https://www.slideshare.net/3G4GLtd
5G Page: https://www.3g4g.co.uk/5G/
Free Training Videos: https://www.3g4g.co.uk/Training/
Introduction of PS Core Network Elements and little bit of EPC/LTE Network. This is introductory slides pack for a 10 class/slides set for detail introduction of 2G/3G and LTE PS Core Network.
Last update: Feb 7, 2021
5G broadband began to be promoted throughout the United States, it not only brought users a faster Internet, but also brought a new technical architecture designed to further support 5G networks.
As operators around the world are looking for solutions to cope with the growing demand for mobile data, it is necessary to develop 5G technology.
One of those architectures is named device-to-device (D2D) communications, which refers to the communication between devices, which may be cellphones or vehicles. this system opens new device-centric communication that always requires no direct communication with the network infrastructure.
This is good because D2D architecture is predicted to unravel a minimum of a part of the network capacity issue as 5G promises more devices to be connected in faster, more reliable networks.
To understand the new 5G technology, the important point is that it does not only involve faster smartphones. In fact, technologists now call 5G the post-smartphone era.
Higher speeds and lower latency will enable new experiences that require continuous communication between augmented reality and virtual reality, connected cars, smart homes, and machines without lag.
Tonex provided 5G Network Architecture, Planning and Design
Tonex training introduced 5G technology, architecture and protocols. Also discussed 5G air interface and core network technologies and solutions. The course includes investigations of traffic cases and solutions, deployments and products. Covers 3GPP and IMT-2020 methods.
Learning Targets:
Explain the key 5G Principles, Services and Technical aspects
Explain the aim of implementing 5G within the existing mobile ecosystem
Describe a number of the 5G Use Cases and Applications: 3GPP and ITU 5G Use Cases (eMBB, URLLC and mMTC)
List 5G Network Features including: functions, nodes and elements, interfaces, reference points, basic operational procedures and architectural choices
Describe the overall 5G specification
Compare and contrast 5G system with traditional LTE, LTE-A and LTE-A Pro systems (3GPP version)
List and explain 5G RAN and core network architecture
Explain 5G access
Describe the 5G system engineering (access network, 5G core) method
Describe the use of NFV/SDN and network slicing in 5G systems
Learn about 5G radio access networks including 5G New Radio (NR)
Audience:
Engineers
Managers
Marketing and operation personnel
Anyone who want to learn 5G systems including 5G Radio Access Network (RAN), 5G New Radio (NR), 5G core and integration with LTE/LTE-A and LTE-A Pro
Course Outline:
Introduction to 5G Mobile Communication
Key Principles of 5G Systems
5G System Architecture
3GPP 5G System Architecture
5G New Radio (NR)
For More Information:
https://www.tonex.com/5g-training-education-5g-wireless/
Determine the required delivery characteristics of a packet stream and how a Traffic Management (TM) module can offload compute-intensive tasks. Hear more about the latest innovations in both DPI & TM solutions.
Dell PowerEdge zero touch provisioning with Auto Config speeds and simplifies server deployment. Using Server Configuration Profiles and your existing data center infrastructure, deploy one or thousands of PowerEdge servers reliably and repeatably. Learn more: http://www.dell.techcenter.com/LC
Soft x3000 operation manual configuration guide. Part replacement is risky. Before the operation, you must estimate whether the risk can
be controlled with some technical protection measures without powering off the
equipment. If so, you can carry out the replacement; if not, contact the regional office of
Huawei immediately for technical support.After taking corresponding protection measures, you can carry out the replacement
operations by following the procedures stipulated in this manual, for example, pulling
out a board, inserting a board, or setting dual-inline package (DIP) switches of a board.
To make sure the security, only professionals who have been trained can replace the
parts which can be replaced only when the back door of the cabinet is open, because
the power distribution box and service frames are with -48V power terminals.
Notes on Equipment Security
Do not pull out two or more than two UPWRs in one frame simultaneously; otherwise,
the running UPWRs in the frame will be overloaded or even burnt.
Never insert UPWRs of different types into the same frame. The UPWR has two types:
a and b. If the UPWRs of different types are inserted in one frame, the current supplied
by various UPWRs is not even, and the UPWR supplying more current will be
overloaded or even burnt.
Complete the replacement of a fan box within five minutes; otherwise, the security and
normal operation of the corresponding service frame will be greatly affected.
LTE is a common standard covering both FDD and TDD flavors, enableing the industry to build common FDD/TDD infrastructure, common devices, and a large common ecosystem. LTE and its evolution LTE Advanced play a critical role in addressing the 1000x increase in mobile data.
Qualcomm has been leading LTE proliferation from the very beginning— from the industry-first Gobi LTE/3G multimode, common FDD/TDD modems to the current third-generation solutions that powered the world’s first LTE Advanced carrier-aggregation launch in June 2013.
For more information please visit www.qualcomm.com/lte
Download the presentation here: http://www.qualcomm.com/media/documents/lte-qualcomm-leading-global-success
This updated presentation/video looks at 5G Network Architecture options that have been proposed by 3GPP for deployment of 5G. It covers the Standalone (SA) and Non-Standalone (NSA) architecture. In the NSA architecture, EN-DC (E-UTRA-NR Dual Connectivity), NGEN-DC (NG-RAN E-UTRA-NR Dual Connectivity) and NE-DC (NR-E-UTRA Dual Connectivity) has been looked at. Finally, migration strategies proposed by vendors and operators (MNOs / SPs) have been discussed.
5G Network Architecture, Design and Optimisation3G4G
Presented by Prof. Andy Sutton, Principal Network Architect, Architecture & Strategy, TSO, BT at The IET '5G - State of Play' conference on 24th January 2018
*** SHARED WITH PERMISSION ***
A quick look at 5G System architecture in Reference point representation and in Service Based representation and also look at the different Network Functions (NFs) within the 5G System.
6G: Potential Use Cases and Enabling Technologies3G4G
This white paper presents an overview of some of the promising applications and use case envisioned for 6G, with the objective to highlight the potential for new markets and to provide an indication of the expected technical requirements. The white paper then describes some of the enabling technologies for meeting the performance requirements of 6G.
Authors: Ritvik Gupta, Student (A-Levels), Sutton Grammar School, London, United Kingdom under the supervision of Dr Biplab Sidkar, Associate Professor , Dept of Electrical and Computer Engineering, National University of Singapore
Comparison of SRv6 Extensions uSID, SRv6+, C-SRHKentaro Ebisawa
Comparing concept, SID and header format of compressed Segment Routing IPv6 proposals such as uSID, SRv6+, C-SRH. Slide presented at SRv6 Consortium @Tokyo on 23rd Aug 2019.
Call Home is an innovative system that enables network functions virtualization (NFV) and software-defined networking (SDN) in situations where virtual customer premises equipment (vCPE) is protected by a cable modem or a firewall. In this presentation, ADVA Optical Networking’s engineers outlined this new technique and explained how it makes secure NFV/SDN deployment possible when a NETCONF client is otherwise unable to initiate an SSH connection directly to the NETCONF server.
Rajendra Nagabhushan and Vikram Darsi discussed how the IETF draft for NETCONF Call Home Using SSH can be implemented as an OpenDaylight feature. They demonstrated how the technology can be applied in a real-world use-case and outlined how an ADVA Optical Networking product is being developed ahead of its 2017 release.
Description and comparison of 3G, 4G and 5G Core Networks. You can find my detailed report in https://medium.com/@sarpkoksal/core-network-evolution-3g-vs-4g-vs-5g-7738267503c7
Ericsson’s proprietary Lean Carrier innovation is first to address intercell signaling interference, introducing lean design concepts to 4G LTE to improve data speed and app coverage for users while on the road to 5G.
Description: In this presentation / video we look at the argument why 5G NSA Option 3 (EN-DC) may not be around for a very long time and may be turned off by the operators when 5G Core is up and running and all the initial issues have been sorted out.
Video link: https://youtu.be/ERnq8WLlse0
All our #3G4G5G slides and videos are available at:
Videos: https://www.youtube.com/3G4G5G
Slides: https://www.slideshare.net/3G4GLtd
5G Page: https://www.3g4g.co.uk/5G/
Free Training Videos: https://www.3g4g.co.uk/Training/
Introduction of PS Core Network Elements and little bit of EPC/LTE Network. This is introductory slides pack for a 10 class/slides set for detail introduction of 2G/3G and LTE PS Core Network.
Last update: Feb 7, 2021
5G broadband began to be promoted throughout the United States, it not only brought users a faster Internet, but also brought a new technical architecture designed to further support 5G networks.
As operators around the world are looking for solutions to cope with the growing demand for mobile data, it is necessary to develop 5G technology.
One of those architectures is named device-to-device (D2D) communications, which refers to the communication between devices, which may be cellphones or vehicles. this system opens new device-centric communication that always requires no direct communication with the network infrastructure.
This is good because D2D architecture is predicted to unravel a minimum of a part of the network capacity issue as 5G promises more devices to be connected in faster, more reliable networks.
To understand the new 5G technology, the important point is that it does not only involve faster smartphones. In fact, technologists now call 5G the post-smartphone era.
Higher speeds and lower latency will enable new experiences that require continuous communication between augmented reality and virtual reality, connected cars, smart homes, and machines without lag.
Tonex provided 5G Network Architecture, Planning and Design
Tonex training introduced 5G technology, architecture and protocols. Also discussed 5G air interface and core network technologies and solutions. The course includes investigations of traffic cases and solutions, deployments and products. Covers 3GPP and IMT-2020 methods.
Learning Targets:
Explain the key 5G Principles, Services and Technical aspects
Explain the aim of implementing 5G within the existing mobile ecosystem
Describe a number of the 5G Use Cases and Applications: 3GPP and ITU 5G Use Cases (eMBB, URLLC and mMTC)
List 5G Network Features including: functions, nodes and elements, interfaces, reference points, basic operational procedures and architectural choices
Describe the overall 5G specification
Compare and contrast 5G system with traditional LTE, LTE-A and LTE-A Pro systems (3GPP version)
List and explain 5G RAN and core network architecture
Explain 5G access
Describe the 5G system engineering (access network, 5G core) method
Describe the use of NFV/SDN and network slicing in 5G systems
Learn about 5G radio access networks including 5G New Radio (NR)
Audience:
Engineers
Managers
Marketing and operation personnel
Anyone who want to learn 5G systems including 5G Radio Access Network (RAN), 5G New Radio (NR), 5G core and integration with LTE/LTE-A and LTE-A Pro
Course Outline:
Introduction to 5G Mobile Communication
Key Principles of 5G Systems
5G System Architecture
3GPP 5G System Architecture
5G New Radio (NR)
For More Information:
https://www.tonex.com/5g-training-education-5g-wireless/
Determine the required delivery characteristics of a packet stream and how a Traffic Management (TM) module can offload compute-intensive tasks. Hear more about the latest innovations in both DPI & TM solutions.
Dell PowerEdge zero touch provisioning with Auto Config speeds and simplifies server deployment. Using Server Configuration Profiles and your existing data center infrastructure, deploy one or thousands of PowerEdge servers reliably and repeatably. Learn more: http://www.dell.techcenter.com/LC
Soft x3000 operation manual configuration guide. Part replacement is risky. Before the operation, you must estimate whether the risk can
be controlled with some technical protection measures without powering off the
equipment. If so, you can carry out the replacement; if not, contact the regional office of
Huawei immediately for technical support.After taking corresponding protection measures, you can carry out the replacement
operations by following the procedures stipulated in this manual, for example, pulling
out a board, inserting a board, or setting dual-inline package (DIP) switches of a board.
To make sure the security, only professionals who have been trained can replace the
parts which can be replaced only when the back door of the cabinet is open, because
the power distribution box and service frames are with -48V power terminals.
Notes on Equipment Security
Do not pull out two or more than two UPWRs in one frame simultaneously; otherwise,
the running UPWRs in the frame will be overloaded or even burnt.
Never insert UPWRs of different types into the same frame. The UPWR has two types:
a and b. If the UPWRs of different types are inserted in one frame, the current supplied
by various UPWRs is not even, and the UPWR supplying more current will be
overloaded or even burnt.
Complete the replacement of a fan box within five minutes; otherwise, the security and
normal operation of the corresponding service frame will be greatly affected.
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues we’ve encountered in developing real machine learning systems at Netflix and some of the lessons we’ve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that we’ve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that we’ve fond to be useful across a wide variety of machine-learned systems.
Balancing Infrastructure with Optimization and Problem FormulationAlex D. Gaudio
- How do we currently think about Data Science?
- Why is infrastructure important to our field?
- Two tools we've built on Sailthru's Data Science team to deal with these problems are "Stolos" and "Relay.Mesos".
ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers.
BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.
Every year the financial industry loses billions because of fraud while in the meantime fraudsters are coming up with more and more sophisticated patterns.
Financial institutions have to find the balance between fraud protection and negative customer experience. Fraudsters bury their patterns in lots of data, but the traditional technologies are not designed to detect fraud in real-time or to see patterns beyond the individual account.
Analyzing relations with graph databases helps uncover these larger complex patterns and speeds up suspicious behavior identification.
Furthermore, graph databases enable fast and effective real-time link queries and passing context to machine learning models.
The earlier fraud pattern or network is identified, the faster the activity is blocked. As a result, losses and fines are minimized.
Introduction to Data Science and AnalyticsSrinath Perera
This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same.
This session will explore analytics under four themes:
Hindsight (what happened)
Oversight (what is happening)
Insight (why is it happening)
Foresight (what will happen)
Recording http://t.co/WcMFEAJHok
How to Feed a Data Hungry Organization – by Traveloka Data TeamTraveloka
In Traveloka's Inaugural Data Meetup held in April 2017, Ainun Najib (Head of Data), Dr. Philip Thomas (Lead Data Scientist), and Rendy B. Junior (Lead Data Engineer) shared about the journey that Traveloka's Data Team have taken so far so that the audience can learn from the struggles and triumphs in managing Traveloka's burgeoning data.
You will learn more about:
1) Data culture in Traveloka
2) Data engineering in Traveloka
3) Data science in Traveloka
To follow our LinkedIn page, visit bit.ly/TravelokaLinkedInPage
Safe Harbor Statement
Our discussion may include predictions, estimates or other information that might be considered conclusive. While these conclusive statements represent our current judgment on the best practices, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on our statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these presentation materials in light of new information or future events.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
What if you could get blazing fast queries on your data without having to be on call for a giant, expensive database? By picking the right file format for your data, you can store your data on disk in the cloud and still get the performance you need for modern analytics. We'll discuss benchmarks of four different data storage formats: Parquet, ORC, Avro, and traditional character-separated files like CSV. We'll cover what they are, how they work at a bits-and-bytes level, and why you might choose each one for your use case.
Similar to R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538 (20)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
Notes about Amazon VPC, a canonical architecture and finally how to implement MongoDB replica sets. My blog http://goo.gl/0guF2 has the color pictures. And the file is at http://doubleclix.files.wordpress.com/2012/10/vpc-distilled-04.pdf. For some reason, slideshare trims the colors.
My talk on NOSQL at OGF29.[Update with OSCON'10 presentation!] But updates do not work reliably in slideshare. So I also have latest version with my blog.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
1. Who will win XLIX?
R, Data Wrangling &
Data Science
January 18, 2015
@ksankar // doubleclix.wordpress.com
“I want to die on Mars but not on impact”
— Elon Musk, interview with Chris Anderson
“The shrewd guess, the fertile hypothesis, the courageous leap to a
tentative conclusion – these are the most valuable coin of the thinker at
work” -- Jerome Seymour Bruner
"There are no facts, only interpretations." - Friedrich Nietzsche
3. Goals & non-goals
Goals
¤ Get familiar with the R
language & dplyr
¤ Work on a couple of interesting
data science problems
¤ Give you a focused time to
work
§ Work with me. I will wait
if you want to catch-up
¤ Less theory, more usage - let
us see if this works
¤ As straightforward as possible
§ The programs can be
optimized
Non-goals
¡ Go deep into the algorithms
• We don’t have
sufficient time. The topic
can be easily a 5 day
tutorial !
¡ Dive into R internals
• That is for another day
¡ A passive talk
• Nope. Interactive &
hands-on
4. Activities & Results
o Activities:
• Get familiar with R, R Studio
• Work on a couple of data sets
• Get familiar with the mechanics of Data Science Competitions
• Explore the intersection of Algorithms, Data, Intelligence, Inference &
Results
• Discuss Data Science Horse Sense ;o)
o Results :
• Hands-on R
• Familiar with some of the interesting algorithms
• Submitted entries for 1 competition
• Knowledge of Model Evaluation
• Cross Validation, ROC Curves
5. About Me
o Chief Data Scientist at BlackArrow.tv
o Have been speaking at OSCON, PyCon, Pydata et al
o Reviewing Packt Book “Machine Learning with Spark”
o Picked up co-authorship Second Edition of “Fast Data Processing with Spark”
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech),
• Written Books (Web 2.0, Wireless, Java,…)
• Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
• Planning MS-CFinance or Statistics
• Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
The
Nuthead
band
!
6. Setup & Data
R & IDE
o Install R
o Install R Studio
Tutorial Materials
o Github : https://
github.com/xsankar/
hairy-octo-hipster
o Clone or download zip
Setup an account in Kaggle (www.kaggle.com)
We will be using the data from 2 Kaggle competitions
① Titanic: Machine Learning from Disaster
Download data from http://www.kaggle.com/c/titanic-gettingStarted
Directory ~/hairy-octo-hipster/titanic-r
② Predicting Bike Sharing @ Washington DC
Download data from http://www.kaggle.com/c/bike-sharing-demand/data
Directory ~/hairy-octo-hipster/bike
③ 2014 NFL Boxscore
http://www.pro-football-reference.com/years/2014/games.htm
Directory ~/hairy-octo-hipster/nfl
Data
7. Agenda
o Jan 18 : 9:00-12:30 3 hrs
o Intro, Goals, Logistics, Setup [10] [9:00-9:10)
o Introduction to R & dplyR [30] [9:10-9:40)
o Who will win Superbowl XLIX ?
The Art of ELO Ranking [30] [9:40-10:10)
• The Algorithm
• The Data
• The Results (Compare with FiveThirtyEight
o Anatomy of a Kaggle Competition [40] [10:10-10:50)
• Competition Mechanics
• Register, download data, create sub
directories
• Trial Run : Submit Titanic
o Break [20] [10:50-11:10)
o Algorithms for the Amateur Data Scientist [20] [11:10-11:30)
• Algorithms, Tools & frameworks in perspective
• “Folk Wisdom”
o Model Evaluation & Interpretation [30] [11:30 - 12:00)
• Confusion Matrix, ROC Graph
o Homework : The Art of a Competition – Bike Sharing
o Homework : The Art of a Competition – Walmart
8. Overload Warning … There is enough material for a week’s training … which is good & bad !
Read thru at your pace, refer, ponder & internalize
9. Close Encounters
— 1st
◦ This Tutorial
— 2nd
◦ Do More Hands-on Walkthrough
— 3nd
◦ Listen To Lectures
◦ More competitions …
11. R Syntax – A quick overview
o aString <- "A String"
o aNumber <- 12
o class(aString)
o class(aNumber)
o aVector <- c(1,2,3,4)
o class(aVector)
o aVector * 2
o sqrt(aVector)
o Packages : dplyR & tidyR
12. Data wrangling with dplyR
o dplyR – versatile package for various data operations
o We will see dplyR is use
o Resources:
• “Data Manipulation with dplyR” - Hadley Wickham’s UseR! 2014
Tutorial Slides
• http://datascience.la/hadley-wickhams-dplyr-tutorial-at-user-2014-part-1/
• Slides https://www.dropbox.com/sh/i8qnluwmuieicxc/
AAAgt9tIKoIm7WZKIyK25lh6a
• Slides of Tutorial by Rstudio’s Garrett Grolemund
• https://github.com/rstudio/webinars
• And the cheatsheet is available at http://www.rstudio.com/resources/
cheatsheets/
17. The Art of ELO Ranking
& Super Bowl XLIX
o Let us look at this from 3 angles:
• The Algorithm
• The R program
• The Data
• The Results
• Comparing with the
FiveThirtyEight Results
http://www.imdb.com/title/tt1285016/trivia?item=qt1318850
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
18.
19. The ELO Algorithm (1 of 3)
1. Basic Chess Algorithm proposed by Elo
• Arpad Emrick Elo proposed the system for Chess ranking
• Rnew = Rold + K(S-μ); μij = 1 / 1 + 10(Riold-Rjold)/400
• K – varies depending on the match
• Sij = 1, ½ or 0
2. Soccer Ranking
• http://www.eloratings.net/system.html
3. NFL Ranking with adjusted factor for scores, 538
Ranking
Ref : Who is #1, Princeton University Press
20. The ELO Algorithm (2 of 3)
NFL Ranking
http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
27. Wisdom from Nate Silver & the 538 Gang …
o [Homework #1] Improve our core algorithm
to add the Margin of victory from the 538
gang !
• Remember, kFactor = 20
o [Homework #2] Weigh recent games more
heavily w/ Exponential Decay
28. The Art of ELO Ranking
& Super Bowl XLIX
o The real formula is
o Not what is written on the glass !
o But then that is Hollywood !
I need the Algorithm, I need the Algorithm
– Mark Z to Eduardo S
Ref : Who is #1, Princeton University Press
29. References:
o ELO ranking – NFL,Soccer
• http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/
• http://fivethirtyeight.com/datalab/nfl-week-20-elo-ratings-and-playoff-
odds-conference-championships/
• http://www.eloratings.net/system.html
o dplyR
• http://www.rstudio.com/resources/webinars/ <- github for the slides
• http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part1/
• http://www.sharpsightlabs.com/data-analysis-example-r-supercars-part2/
• http://www.rstudio.com/resources/cheatsheets/
• http://www.r-bloggers.com/data-analysis-example-with-ggplot-and-dplyr-
analyzing-supercar-data-part-2/
31. Kaggle Data Science Competitions
o Hosts Data Science Competitions
o Competition Attributes:
• Dataset
• Train
• Test (Submission)
• Final Evaluation Data Set (We don’t
see)
• Rules
• Time boxed
• Leaderboard
• Evaluation function
• Discussion Forum
• Private or Public
32. Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic
http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html
City Bike Sharing Prediction (Washington DC)
Walmart Store Forecasting
33. Train.csv
Taken
from
Titanic
Passenger
Manifest
Variable
Descrip-on
Survived
0-‐No,
1=yes
Pclass
Passenger
Class
(
1st,2nd,3rd
)
Sibsp
Number
of
Siblings/Spouses
Aboard
Parch
Number
of
Parents/Children
Aboard
Embarked
Port
of
EmbarkaMon
o C
=
Cherbourg
o Q
=
Queenstown
o S
=
Southampton
Titanic
Passenger
Metadata
• Small
• 3
Predictors
• Class
• Sex
• Age
• Survived?
35. Approach
o This is a classification problem - 0 or 1
o Comb the forums !
o Opportunity for us to try different algorithms & compare them
• Simple Model
• CART[Classification & Regression Tree]
• Greedy, top-down binary, recursive partitioning that divides feature space into sets
of disjoint rectangular regions
• RandomForest
• Different parameters
• SVM
• Multiple kernels
• Table the results
o Use cross validation to predict our model performance & correlate with what Kaggle
says
http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf
36. Simple Model – Our First Submission
o #1 : Simple Model (M=survived)
o #2 : Simple Model (F=survived)
https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python
Refer
to
1-‐Intro_to_Kaggle.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
37. #3 : Simple CART Model
o CART (Classification & Regression Tree)
hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees
May be better, because we have improved on the survival of
men !
Refer
to
1-‐Intro_to_Kaggle.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
38. #4 : Random Forest Model
o https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
• Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/
o https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
o https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py
Refer
to
1-‐Intro_to_Kaggle.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
39. #5 : SVM
o Multiple Kernels
o kernel = ‘radial’ #Radial Basis Function
o Kernel = ‘sigmoid’
o agconti's blog - Ultimate Titanic !
o http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713
Refer
to
1-‐Intro_to_Kaggle.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
40. Feature Engineering - Homework
o Add attribute : Age
• In train 714/891 have age; in test 332/418 have age
• Missing values can be just Mean Age of all passengers
• We could be more precise and calculate Mean Age based on Title (Ms,
Mrs, Master et al)
• Box plot age
o Add attribute : Mother, Family size et al
o Feature engineering ideas
• http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/
sharing-experiences-about-data-munging-and-classification-steps-
with-python
o More ideas at
http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/
o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
41. What does it mean ? Let us ponder ….
o We have a training data set representing a domain
• We reason over the dataset & develop a model to predict outcomes
o How good is our prediction when it comes to real life scenarios ?
o The assumption is that the dataset is taken at random
• Or Is it ? Is there a Sampling Bias ?
• i.i.d ? Independent ? Identically Distributed ?
• What about homoscedasticity ? Do they have the same finite variance ?
o Can we assure that another dataset (from the same domain) will give us the same
result ?
o Will our model & it’s parameters remain the same if we get another data set ?
o How can we evaluate our model ?
o How can we select the right parameters for a selected model ?
43. Algorithms for the
Amateur Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any
man who can hitch the length and breadth of the Galaxy, rough it … win through, and still
know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …
11:10
44. Ref: Anthony’s Kaggle Presentation
Data Scientists apply different techniques
• Support Vector Machine
• adaBoost
• Bayesian Networks
• Decision Trees
• Ensemble Methods
• Random Forest
• Logistic Regression
• Genetic Algorithms
• Monte Carlo Methods
• Principal Component Analysis
• Kalman Filter
• Evolutionary Fuzzy Modelling
• Neural Networks
Quora
• http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms
45. Algorithm spectrum
o Regression
o Logit
o CART
o Ensemble :
Random
Forest
o Clustering
o KNN
o Genetic Alg
o Simulated
Annealing
o Collab
Filtering
o SVM
o Kernels
o SVD
o NNet
o Boltzman
Machine
o Feature
Learning
Machine
Learning
Cute
Math
Ar0ficial
Intelligence
46. Classifying Classifiers
Statistical
Structural
Regression
Naïve
Bayes
Bayesian
Networks
Rule-‐based
Distance-‐based
Neural
Networks
Production
Rules
Decision
Trees
Multi-‐layer
Perception
Functional
Nearest
Neighbor
Linear
Spectral
Wavelet
kNN
Learning
vector
Quantization
Ensemble
Random
Forests
Logistic
Regression1
SVM
Boosting
1Max
Entropy
Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
49. Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer
Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond
it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive datasetA few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
50. Data Science “folk knowledge” (2 of A)
o Over fitting has many faces
• Bias – Model not strong enough. So the learner has the tendency to learn the
same wrong things
• Variance – Learning too much from one dataset; model will fall apart (ie much
less accurate) on a different dataset
• Sampling Bias
o Intuition Fails in high Dimensions –Bellman
• Blessing of non-conformity & lower effective dimension; many applications
have examples not uniformly spread but concentrated near a lower dimensional
manifold eg. Space of digits is much smaller then the space of images
o Theoretical Guarantees are not What they seem
• One of the major developments o f recent decades has been the realization that
we can have guarantees on the results of induction, particularly if we are
willing to settle for probabilistic guarantees.
o Feature engineering is the Key
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
51. Data Science “folk knowledge” (3 of A)
o More Data Beats a Cleverer Algorithm
• Or conversely select algorithms that improve with data
• Don’t optimize prematurely without getting more data
o Learn many models, not Just One
• Ensembles ! – Change the hypothesis space
• Netflix prize
• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracy
o Representable Does not imply Learnable
• Just because a function can be represented does not mean
it can be learned
o Correlation Does not imply Causation
o http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o A few useful things to know about machine learning - by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
52. Data Science “folk knowledge” (4 of A)
o The simplest hypothesis that fits the data is also the most
plausible
• Occam’s Razor
• Don’t go for a 4 layer Neural Network unless
you have that complex data
• But that doesn’t also mean that one should
choose the simplest hypothesis
• Match the impedance of the domain, data & the
algorithms
o Think of over fitting as memorizing as opposed to learning.
o Data leakage has many forms
o Sometimes the Absence of Something is Everything
o [Corollary] Absence of Evidence is not the Evidence of
Absence
New to Machine Learning? Avoid these three mistakes, James Faghmous
https://medium.com/about-data/73258b3848a4
§ Simple
Model
§ High
Error
line
that
cannot
be
compensated
with
more
data
§ Gets
to
a
lower
error
rate
with
less
data
points
§ Complex
Model
§ Lower
Error
Line
§ But
needs
more
data
points
to
reach
decent
error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
53. Importance of feature selection & weak models
o “Good features allow a simple model to beat a complex model”-Ben Lorica1
o “… using many weak predictors will always be more accurate than using a few
strong ones …” –Vladimir Vapnik2
o “A good decision rule is not a simple one, it cannot be described by a very few
parameters” 2
o “Machine learning science is not only about computers, but about humans, and
the unity of logic, emotion, and culture.” 2
o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well,
but it can’t surprise you” – Hadley Wickham3
hTp://radar.oreilly.com/2014/06/streamlining-‐feature-‐engineering.html
hTp://nauMl.us/issue/6/secret-‐codes/teaching-‐me-‐so^ly
hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-‐modeling-‐and-‐surprises/
Updated
Slide
54. Check your assumptions
o The decisions a model makes, is directly related to the it’s assumptions about the
statistical distribution of the underlying data
o For example, for regression one should check that:
① Variables are normally distributed
• Test for normality via visual inspection, skew & kurtosis, outlier inspections via
plots, z-scores et al
② There is a linear relationship between the dependent & independent
variables
• Inspect residual plots, try quadratic relationships, try log plots et al
③ Variables are measured without error
④ Assumption of Homoscedasticity
§ Homoscedasticity assumes constant or near constant error variance
§ Check the standard residual plots and look for heteroscedasticity
§ For example in the figure, left box has the errors scattered randomly around zero; while the
right two diagrams have the errors unevenly distributed
Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test,
http://pareonline.net/getvn.asp?v=8&n=2
55. Data Science “folk knowledge” (5 of A)
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown
Known
o Others
know,
you
don’t
o What
we
do
o Facts,
outcomes
or
scenarios
we
have
not
encountered,
nor
considered
o “Black
swans”,
outliers,
long
tails
of
probability
distribuMons
o Lack
of
experience,
imaginaMon
o PotenMal
facts,
outcomes
we
are
aware,
but
not
with
certainty
o StochasMc
processes,
ProbabiliMes
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
56. Data Science “folk knowledge” (6 of A) - Pipeline
o Scalable
Model
Deployment
o Big
Data
automation
&
purpose
built
appliances
(soft/
hard)
o Manage
SLAs
&
response
times
o Volume
o Velocity
o Streaming
Data
o Canonical
form
o Data
catalog
o Data
Fabric
across
the
organization
o Access
to
multiple
sources
of
data
o Think
Hybrid
–
Big
Data
Apps,
Appliances
&
Infrastructure
Collect Store Transform
o Metadata
o Monitor
counters
&
Metrics
o Structured
vs.
Multi-‐
structured
o Flexible
&
Selectable
§ Data
Subsets
§ Attribute
sets
o Refine
model
with
§ Extended
Data
subsets
§ Engineered
Attribute
sets
o Validation
run
across
a
larger
data
set
Reason Model Deploy
Data Management
Data Science
o Dynamic
Data
Sets
o 2
way
key-‐value
tagging
of
datasets
o Extended
attribute
sets
o Advanced
Analytics
ExploreVisualize Recommend Predict
o Performance
o Scalability
o Refresh
Latency
o In-‐memory
Analytics
o Advanced
Visualization
o Interactive
Dashboards
o Map
Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
57. Volume
Velocity
Variety
Data Science “folk knowledge” (7 of A)
Context
Connect
edness
Intelligence
Interface
Inference
“Data of unusual size”
that can't be brute forced
o Three Amigos
o Interface = Cognition
o Intelligence = Compute(CPU) & Computational(GPU)
o Infer Significance & Causality
58. Data Science “folk knowledge” (8 of A)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-
howard/
59. Data Science “folk knowledge” (9 of A)
① Common Sense (some features make more sense then others)
② Carefully read these forums to get a peak at other peoples’ mindset
③ Visualizations
④ Train a classifier (e.g. logistic regression) and look at the feature weights
⑤ Train a decision tree and visualize it
⑥ Cluster the data and look at what clusters you get out
⑦ Just look at the raw data
⑧ Train a simple classifier, see what mistakes it makes
⑨ Write a classifier using handwritten rules
⑩ Pick a fancy method that you want to apply (Deep Learning/Nnet)
-- Maarten Bosma
-- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data
60. Data Science “folk knowledge” (A of A)
Lessons from Kaggle Winners
① Don’t over-fit
② All predictors are not needed
• All data rows are not needed, either
③ Tuning the algorithms will give different results
④ Reduce the dataset (Average, select transition data,…)
⑤ Test set & training set can differ
⑥ Iteratively explore & get your head around data
⑦ Don’t be afraid to submit simple solutions
⑧ Keep a tab & history your submissions
61. The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
Data Scientist (noun): Person who is better at
statistics than any software engineer & better
at software engineering than any statistician
– Josh Wills (Cloudera)
Data Scientist (noun): Person who is worse at
statistics than any statistician & worse at
software engineering than any software
engineer – Will Cukierski (Kaggle)
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !
– Titus Brown
62. Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/
lack_of_a_priori_distinctions_wolpert.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg,
Y. C
• http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y
%20FDR.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/
cs670_Tran_PreferredPaper_LeakingInDataMining.pdf
63. For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~‾gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~‾tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
65. Bias/Variance (1 of 2)
o Model Complexity
• Complex Model increases the
training data fit
• But then it overfits & doesn't
perform as well with real data
o Bias vs. Variance
o Classical diagram
o From ELSII, By Hastie, Tibshirani & Friedman
o Bias – Model learns wrong things; not
complex enough; error gap small; more
data by itself won’t help
o Variance – Different dataset will give
different error rate; over fitted model;
larger error gap; more data could help
Prediction Error
Training
Error
Ref: Andrew Ng/Stanford, Yaser S./CalTech
Learning Curve
66. Bias/Variance (2 of 2)
o High Bias
• Due to Underfitting
• Add more features
• More sophisticated model
• Quadratic Terms, complex equations,…
• Decrease regularization
o High Variance
• Due to Overfitting
• Use fewer features
• Use more training sample
• Increase Regularization
Prediction Error
Training
Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need
more
features
or
more
complex
model
to
improve
Need
more
data
to
improve
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
67. Partition Data !
• Training (60%)
• Validation(20%) &
• “Vault” Test (20%) Data sets
k-fold Cross-Validation
• Split data into k equal parts
• Fit model to k-1 parts &
calculate prediction error on kth
part
• Non-overlapping dataset
Data Partition &
Cross-Validation
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
Train
Validate
Test
#2
#3
#4
#5
#1
#2
#3
#5
#4
#1
#2
#4
#5
#3
#1
#3
#4
#5
#2
#1
#3
#4
#5
#1
#2
K-‐fold
CV
(k=5)
Train
Validate
68. Bootstrap
• Draw datasets (with replacement) and fit model for each dataset
• Remember : Data Partitioning (#1) & Cross Validation (#2) are without
replacement
Bootstrap & Bagging
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
Bagging (Bootstrap aggregation)
◦ Average prediction over a collection of
bootstrap-ed samples, thus reducing
variance
69. ◦ “Output
of
weak
classifiers
into
a
powerful
commiTee”
◦ Final
PredicMon
=
weighted
majority
vote
◦ Later
classifiers
get
misclassified
points
– With
higher
weight,
– So
they
are
forced
– To
concentrate
on
them
◦ AdaBoost
(AdapMveBoosting)
◦ BoosMng
vs
Bagging
– Bagging
–
independent
trees
– BoosMng
–
successively
weighted
Boosting
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
70. ◦ Builds
large
collecMon
of
de-‐correlated
trees
&
averages
them
◦ Improves
Bagging
by
selecMng
i.i.d*
random
variables
for
spliong
◦ Simpler
to
train
&
tune
◦ “Do
remarkably
well,
with
very
li@le
tuning
required”
–
ESLII
◦ Less
suscepMble
to
over
fiong
(than
boosMng)
◦ Many
RF
implementaMons
– Original
version
-‐
Fortran-‐77
!
By
Breiman/Cutler
– Python,
R,
Mahout,
Weka,
Milk
(ML
toolkit
for
py),
matlab
* i.i.d – independent identically distributed
+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
71. ◦ Two
Step
– Develop
a
set
of
learners
– Combine
the
results
to
develop
a
composite
predictor
◦ Ensemble
methods
can
take
the
form
of:
– Using
different
algorithms,
– Using
the
same
algorithm
with
different
seongs
– Assigning
different
parts
of
the
dataset
to
different
classifiers
◦ Bagging
&
Random
Forests
are
examples
of
ensemble
method
Ref: Machine Learning In Action
Ensemble Methods
— Goal
◦ Model Complexity (-)
◦ Variance (-)
◦ Prediction Accuracy (+)
72. Random Forests
o While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variables
o Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)
o Error prediction
• For each iteration, predict for dataset that is not in the sample (OOB data)
• Aggregate OOB predictions
• Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate
• Can use this to search for optimal # of predictors
• We will see how close this is to the actual error in the Heritage Health Prize
o Assumes equal cost for mis-prediction. Can add a cost function
o Proximity matrix & applications like adding missing data, dropping outliers
Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : Berk
A Brief Overview of RF by Dan Steinberg
76. Model Evaluation - Accuracy
o Accuracy =
o For cases where tn is large compared tp, a degenerate return(false) will be
very accurate !
o Hence the F-measure is a better reflection of the model strength
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
–
Type
II
Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)
tp
+
tn
tp+fp+fn+tn
77. Model Evaluation – Precision & Recall
o Precision = How many items we identified are relevant
o Recall = How many relevant items did we identify
o Inverse relationship – Tradeoff depends on situations
• Legal – Coverage is important than correctness
• Search – Accuracy is more important
• Fraud
• Support cost (high fp) vs. wrath of credit card co.(high fn)
tp
tp+fp
• Precision
• Accuracy
• Relevancy
tp
tp+fn
• Recall
• True
+ve
Rate
• Coverage
• Sensitivity
• Hit
Rate
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
fp
fp+tn
• Type
1
Error
Rate
• False
+ve
Rate
• False
Alarm
Rate
• Specificity
=
1
–
fp
rate
• Type
1
Error
=
fp
• Type
2
Error
=
fn
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II
Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)
79. Model Evaluation : F-Measure
Precision = tp / (tp+fp) : Recall = tp / (tp+fn)
F-Measure
Balanced, Combined, Weighted Harmonic Mean, measures effectiveness
=
β2
P
+
R
Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R
+
(1
–
α)
α
1
P
1
R
1
(β2
+
1)PR
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
-‐
Type
II
Actual=0
False+
(fp)
-‐
Type
I
True-‐
(tn)
80. Hands-on Walkthru - Model Evaluation
Train
Test
712 (80%) 179
891
hTp://cran.r-‐project.org/web/packages/e1071/vigneTes/
svmdoc.pdf
-‐
model
eval
Kappa
measure
is
interesMng
Refer
to
2-‐Model_EvaluaMon.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
81. ROC Analysis
o “How good is my model?”
o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
o “A receiver operating characteristics (ROC) graph is a technique for visualizing,
organizing and selecting classifiers based on their performance”
o Much better than evaluating a model based on simple classification accuracy
o Plots tp rate vs. fp rate
82. ROC Graph - Discussion
o E = Conservative, Everything
NO
o H = Liberal, Everything YES
o Am not making any
political statement !
o F = Ideal
o G = Worst
o The diagonal is the chance
o North West Corner is good
o South-East is bad
• For example E
• Believe it or Not - I have
actually seen a graph
with the curve in this
region !
E
F
G
H
Predicted=1
Predicted=0
Actual
=1
True+
(tp)
False-‐
(fn)
Actual=0
False+
(fp)
True-‐
(tn)
83. ROC Graph – Clinical Example
Ifcc
:
Measures
of
diagnostic
accuracy:
basic
definitions
84. ROC Graph Walk thru
Refer
to
2-‐Model_EvaluaMon.R
at
hTps://github.com/xsankar/hairy-‐octo-‐hipster/
86. References:
o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas
• http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-
learning
o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel
• http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn
o Just The Basics, Strata 2013, William Cukierski & Ben Hamner
• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing
• http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/
PIIS1934148209014609.pdf
89. Few interesting Links - Comb the forums
o Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing
• Solution by Brandon Harris
o Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-
prediction
o GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm
o Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare
o Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances
o Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour
o Casual & Registered Users :
http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count
o RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please
o http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r
o Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data
o Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402
o Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html
90. Data Organization – train, test & submission
• datetime - hourly date + timestamp
• Season
• 1 = spring, 2 = summer, 3 = fall, 4 = winter
• holiday - whether the day is considered a holiday
• workingday - whether the day is neither a weekend nor holiday
• Weather
• 1: Clear, Few clouds, Partly cloudy, Partly cloudy
• 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
• 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,
Light Rain + Scattered clouds
• 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
• temp - temperature in Celsius
• atemp - "feels like" temperature in Celsius
• humidity - relative humidity
• windspeed - wind speed
• casual - number of non-registered user rentals initiated
• registered - number of registered user rentals initiated
• count - number of total rentals